Shared memory has always had a vital role to play in the CUDA memory hierarchy known as the User-Managed Cache. This provides a mechanism for users so that they can read/write data in a coalesced fashion from global memory and store it in memory, which acts like a cache but can be controlled by the user. In this section, we will not only go through the steps we can take to make use of shared memory but also talk about how we can efficiently load/store data from shared memory and how it is internally arranged in banks. Shared memory is only visible to threads in the same block. All of the threads in a block see the same version of a shared variable.
Shared memory has similar benefits to a CPU cache; however, while a CPU cache cannot be explicitly managed, shared memory can. Shared memory has an order of magnitude lower latency than global memory and an order...