With focused profiling, we can profile a limited, specific area by using cudaProfilerStart() and cudaProfilerStop(). However, if we want to analyze functional performance in a complex application, it is limited. For this situation, the CUDA profiler provides timeline annotations via the NVIDIA Tools Extension (NVTX).
Using NVTX, we can annotate the CUDA code. We can use the NVTX API as follows:
nvtxRangePushA("Annotation");
.. { Range of GPU operations } ..
cudaDeviceSynchronization(); // in case if the target code block is pure kernel calls
nvtxRangePop();
As you can see, we can define a range as a group of codes and annotate that range manually. Then, the CUDA profiler provides a timeline trace of the annotation so that we can measure the execution time of code blocks. One drawback of this is that the NVTX APIs are host functions...