Prefix sum (scan) is used to obtain a cumulative number array from the given input numbers array. For example, we can make a prefix-sum sequence as follows:
Input numbers | 1 | 2 | 3 | 4 | 5 | 6 | ... |
Prefix sums | 1 | 3 | 6 | 10 | 15 | 21 | ... |
It differs from parallel reduction since reduction just generates the total operation output from the given input data. On the other hand, scan generates outputs from each operation. The easiest way to solve this problem is to iterate all the inputs to generate the output. However, it would take a long time and would be inefficient in GPUs. Hence, the mild approach can parallelize the prefix-sum operation, as follows:
In this approach, we can obtain the output using multiple CUDA cores. However, this method does not reduce the total number of iterations because the first input element should be added for all the outputs...