CUDA 9.0 introduces new warp synchronous programming. This major change aims to avoid CUDA programming relying on implicit warp synchronize operations and handling synchronous targets explicitly. This helps to prevent inattentive race conditions and deadlocks in warp-wise synchronous operations.
Historically, CUDA provided only one explicit synchronization API, __syncthreads() for the CUDA threads in a thread block and it relied on the implicit synchronization of a warp. The following figure shows two levels of synchronization of a CUDA thread block's operation:
However, the latest GPU architectures (Volta and Turing) have an enhanced thread control model, where each thread can execute a different instruction, while they keep its SIMT programming model. The following diagram shows how it has changed:
Until the Pascal architecture (left...