-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
Accessing global memory is slow, which is why there are two layers of cache in between to reduce the effective latency. L1 and L2 caches cannot be explicitly programmed; their behavior is controlled indirectly via data access patterns, through configuration, and by using specific load and store operations. However, some algorithms with predictable data access patterns and frequent data reuse benefit from low-latency memory that is explicitly controlled. This section explains two mechanisms that reduce global memory requests and cache pressure: shared memory and warp intrinsics.
Shared memory is memory that lives on the same hardware as the L1 cache. Each SM has its own L1 cache, with data access that is about 20-30 times faster than accessing VRAM.
As the name implies, shared memory is shared by all threads in a block. Therefore, it is useful as a programmable scratch space for keeping data that needs to be reused...