-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
Occupancy measures how effectively SMs are utilized, defined as the ratio of active warps to the maximum possible warps per SM. The maximum number of warps that may be active depends on the GPU architecture and can be found in Table 28 on this page of the official CUDA programming guide:
For example, the RTX 3080 with the Ampere architecture and compute capability 8.6 supports a maximum of 48 warps or 1,536 active threads per multiprocessor, or 3,264 warps and 104,448 threads for the entire device with 68 SMs. Notice that the maximum number of threads is an order of magnitude greater than there are CUDA cores on the device. The reason for having more active threads than cores is to hide latency.
GPUs perform latency hiding by maximizing throughput. Let's break that down. A single arithmetic instruction can take anywhere from a few to tens of clock cycles to...