-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
GPU Programming with C++ and CUDA
By :
As mentioned earlier in this chapter, in the threads section, there is special on-chip memory that (again, as of the time of writing) has only 48 KB per block that is visible to all the threads in that block. That 48KB is memory enough to keep, let’s say, an array of float numbers to store the result of each thread’s execution. That way, by the end of execution, one of the threads – typically thread 0 – can go over all of that shared array and sum all the values.
Another typical use for shared memory is pre-fetching data from global memory and processing it locally. Let’s say we need to compute the dot product of a vector against many other vectors. We could load vector A to shared memory and reuse this against all the other global accesses, when accessing the elements of each row from matrix M. Although simple, this would save global memory access since the threads would benefit from cached values for the rows...