-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
As discussed in Chapter 3, Numba-CUDA compiles kernels to a lower-level representation that details low-level instructions for the device. The lowest-level human-readable representation is PTX assembly. While assembly code is much more difficult to read compared to the CUDA kernel in a high-level language, it is a more transparent representation of the work that will be performed. Often, many heuristics are used to explain how a change to a kernel impacts performance. However, the most reliable method for understanding what the kernel is doing is by inspecting the assembly code.
This section aims to give a bird's-eye view of PTX assembly, such that you come to understand the syntax. The goal is to be able to read PTX, detect anomalies, and discover clues for how to improve a kernel.
Let's have a look at the assembly representation of the sine kernel from the previous section. The PTX representation of the kernel can be accessed...