CUDA dynamic parallelism (CDP) is a device runtime feature that enables nested calls from device functions. These nested calls allow different parallelism for the child grid. This feature is useful when you need a different block size depending on the problem.
CUDA dynamic parallelism
Understanding dynamic parallelism
Like normal kernel calls from the host, the GPU kernel call can make a kernel call as well. The following sample code shows how it works:
__global__ void child_kernel(int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
atomicAdd(&data[idx], seed);
}
__global__ void parent_kernel(int *data)
{
if (threadIdx.x == 0) {
int child_size = BUF_SIZE/gridDim.x;
child_kernel<<< child_size...