-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
In Chapter 3, threads were assumed to run independently, and the importance of the warp was not considered. However, warp-level behavior significantly impacts performance.
While all threads in a warp execute in lock-step, they can follow independent paths. This is only possible if all threads follow all paths, but only execute relevant instructions. When kernels contain extensive control flow, threads may spend considerable time waiting for others in the same warp, a phenomenon known as warp divergence.
A classic example of warp divergence occurs with conditional branches. If threads in a warp split between the if and else branches, all threads execute both paths sequentially. Threads not taking a branch stall, waiting for others to complete it.
This behavior can be demonstrated with the following kernel:
@cuda.jit
def warp_divergence(output):
t_id = cuda.grid(1)
if t_id % 2 == 0:
val = 1.0
cuda.nanosleep(1000)
else:
val =...