Reduction is a simple but useful algorithm to obtain a common parameter across many parameters. This task can be done in sequence or in parallel. When it comes to parallel processing to a parallel architecture, parallel reduction is the fastest way of getting a histogram, mean, or any other statistical values.
The following diagram shows the difference between sequential reduction and parallel reduction:
By having the reduction tasks in parallel, the parallel reduction algorithm can reduce the total steps at a log scale. Now, let's begin to implement this parallel reduction algorithm on the GPU. Firstly, we will implement this with a simple design using global memory. Then, we will implement another reduction version using the shared memory. By comparing the two implementations, we will discuss what brings a performance difference...