In this chapter, we covered how to configure CUDA parallel operations and optimize them. To do this, we have to understand the relationship between CUDA's hierarchical architecture thread block and streaming multiprocessors. With some performance models—occupancy, performance limiter analysis, and the Roofline model—we could optimize more performance. Then, we covered some new CUDA thread programmability, Cooperative Groups, and learned how this simplifies parallel programming. We optimized parallel reduction problems and achieved 0.259 ms with elements, which is a 17.8 increase in speed with the same GPU. Finally, we learned about CUDA's SIMD operations with half-precision (FP16) and INT8 precision.
Our experience from this chapter focuses on the GPU's parallel processing level programming. However, CUDA programming includes system...