-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
Synchronous copies such as cuda.to_device() and cuda.copy_to_host() block the CPU thread until the transfer is complete via enforcing implicit synchronization. This means that the CPU cannot schedule kernels or additional transfers while waiting, which prevents overlap between data movement and computation. To allow such overlap, we must instead use asynchronous copies, using cuda.to_device(..., stream=stream) or device_array.copy_to_host(stream=stream), which return control to the CPU immediately and let it continue scheduling kernels or further transfers while the data transfer proceeds in the background.
The following snippet of code illustrates how two asynchronous transfers can be issued on two separate streams:
device_array_1 = cuda.to_device(array_1, stream=stream1)
device_array_2 = cuda.to_device(array_2, stream=stream2)
On most GPUs, only one transfer in each direction (either H2D or D2H) can overlap with computation. More recent GPUs (e.g., A100)...