Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning. Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.

Title Page

Learn CUDA Programming

Dedication

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Introduction to CUDA Programming

The history of high-performance computing

Technical requirements

Hello World from CUDA

Vector addition using CUDA

Error reporting in CUDA

Data type support in CUDA

Summary

CUDA Memory Management

Technical requirements

NVIDIA Visual Profiler

Global memory/device memory

Summary

CUDA Thread Programming

Technical requirements

CUDA threads, blocks, and the GPU

Understanding CUDA occupancy

Understanding parallel reduction

Identifying the application's performance limiter

Minimizing the CUDA warp divergence effect

Performance modeling and balancing the limiter

Warp-level primitive programming

Cooperative Groups for flexible thread handling

Loop unrolling in the CUDA kernel

Atomic operations

Low/mixed precision operations

Summary

Kernel Execution Model and Optimization Strategies

Technical requirements

Kernel execution with CUDA streams

Pipelining the GPU execution

The CUDA callback function

CUDA streams with priority

Kernel execution time estimation using CUDA events

CUDA dynamic parallelism

Grid-level cooperative groups

CUDA kernel calls with OpenMP

Multi-Process Service

Kernel execution overhead comparison

Summary

CUDA Application Profiling and Debugging

Technical requirements

Profiling focused target ranges in GPU applications

Profiling with NVTX

Visual profiling against the remote machine

Debugging a CUDA application with CUDA error

Asserting local GPU values using CUDA assert

Debugging a CUDA application with Nsight Visual Studio Edition

Debugging a CUDA application with Nsight Eclipse Edition

Debugging a CUDA application with CUDA-GDB

Runtime validation with CUDA-memcheck

Profiling GPU applications with Nsight Systems

Profiling a kernel with Nsight Compute

Summary

Scalable Multi-GPU Programming

Technical requirements

Solving a linear equation using Gaussian elimination

GPUDirect peer to peer

Brief introduction to MPI

GPUDirect RDMA

CUDA streams

Additional tricks

Summary

Parallel Programming Patterns in CUDA

Technical requirements

Matrix multiplication optimization

Convolution

Prefix sum (scan)

Compact and split

N-body

Histogram calculation

Quicksort in CUDA using dynamic parallelism

Radix sort

Summary

Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

Mixed-precision operation using cuBLAS

cuRAND for parallel random number generation

cuFFT for Fast Fourier Transformation in GPU

NPP for image and signal processing with GPU

Writing GPU accelerated code in OpenCV

Writing Python code that works with CUDA

NVBLAS for zero coding acceleration in Octave and R

CUDA acceleration in MATLAB

Summary

GPU Programming Using OpenACC

Technical requirements

OpenACC directives

Asynchronous programming in OpenACC

Additional important directives and clauses

Summary

Deep Learning Acceleration with CUDA

Technical requirements

Fully connected layer acceleration with cuBLAS

Activation layer with cuDNN

Softmax and loss functions in cuDNN/CUDA

Convolutional neural networks with cuDNN

Recurrent neural network optimization

Profiling deep learning frameworks

Summary

Appendix

Useful nvidia-smi commands

WDDM/TCC mode in Windows

Performance modeling

Exploring container-based development

Another Book You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Hello World from CUDA

CUDA is a heterogeneous programming model that includes provisions for both CPU and GPU. The CUDA C/C++ programming interface consists of C language extensions so that you can target portions of source code for parallel execution on the device (GPU). It is based on industry-standard C/C++ and provides a library of C functions that can be executed on the host (CPU) so that it can interact with the device.

In CUDA, there are two processors that work with each other. The host is usually referred to as the CPU, while the device is usually referred to as the GPU. The host is responsible for calling the device functions. As we've already mentioned, part of the code that runs on the GPU is called device code, while the serial code that runs on the CPU is called host code.

Let's start by writing our first CUDA code in C. The intention is to take a systematic step-wise approach, start with some sequential code, and convert it into CUDA-aware code by adding some additional keywords. As we mentioned earlier, there is no necessity to learn a new language—all we need to do is add some keywords to the existing language so that we can run it in a heterogeneous environment with CPU and GPU.

Let's take a look at our first piece of code. All this code does is print Hello World! from both the host and device:

#include<stdio.h>
#include<stdlib.h>

__global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d,%d] \
        From device\n", threadIdx.x,blockIdx.x);
}

int main(void) {
    printf("Hello World from host!\n");
    print_from_gpu<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Let's try to compile and run the preceding snippet:

Compile the code: Place the preceding code into a file called hello_world.cu and compile it using the NVIDIA C Compiler (nvcc). Note that the extension of the file is .cu, which tells the compiler that this file has GPU code inside it:

$ nvcc -o hello_world hello_world.cu

Execute the GPU code: We should receive the following output after executing the GPU code:

By now, you might have already observed that the CUDA C code isn't used very differently and only requires that we learn some additional constructs to tell the compiler which function is GPU code and how to call a GPU function. It isn't like we need to learn a new language altogether.

In the preceding code, we added a few constructs and keywords, as follows:

__global__: This keyword, when added before the function, tells the compiler that this is a function that will run on the device and not on the host. However, note that it is called by the host. Another important thing to note here is that the return type of the device function is always "void". Data-parallel portions of an algorithm are executed on the device as kernels.
<<<,>>>: This keyword tells the compiler that this is a call to the device function and not the host function. Additionally, the 1,1 parameter basically dictates the number of threads to launch in the kernel. We will cover the parameters inside angle brackets later. For now, the 1,1 parameter basically means we are launching the kernel with only one thread, that is, sequential code with a thread since we are not doing anything important in the code apart from printing.
threadIdx.x, blockIdx.x: This is a unique ID that's given to all threads. We will cover this topic more in the next section.
cudaDeviceSynchronize(): All of the kernel calls in CUDA are asynchronous in nature. The host becomes free after calling the kernel and starts executing the next instruction afterward. This should come as no big surprise since this is a heterogeneous environment and hence both the host and device can run in parallel to make use of the types of processors that are available. In case the host needs to wait for the device to finish, APIs have been provided as part of CUDA programming that make the host code wait for the device function to finish. One such API is cudaDeviceSynchronize, which waits until all of the previous calls to the device have finished.

Try removing the cudaDeviceSynchronize() call and see whether the device output is visible or not. Alternatively, try putting this call before printing it on the host code.

Thread hierarchy

Now, let's start playing around with the two parameters, that is, threadIdx.x and blockIdx.x.

Experiment 1: First, change the parameter from <<<1,1>>> to <<<2,1>> and view the output. The output of running multiple thread-single blocks of Hello World code should be as follows:

As we can see, instead of one thread, we now have two threads printing the value. Note that their unique IDs are different.

Experiment 2: Now, instead of changing the first parameter, let's change the second, that is, change <<<1,1>>> to <<<1,2>>> and observe the output of running multiple single-thread blocks of Hello World code, as follows:

As you can see, the total number of threads that were launched into the kernel is two, just like before—the only difference is that their IDs are different. So, what are these thread and block concepts? To combat this, let's dive into the GPU architecture some more.

GPU architecture

One of the key reasons why CUDA became so popular is because the hardware and software have been designed and tightly bound to get the best performance out of the application. Due to this, it becomes necessary to show the relationship between the software CUDA programming concepts and the hardware design itself.

The following screenshot shows the two sides of CUDA:

We can see that the CUDA software has been mapped to the GPU hardware.

The following table, in accordance with the preceding screenshot, explains software and hardware mapping in terms of the CUDA programming model:

Software	Executes on/as	Hardware
CUDA thread		CUDA Core/SIMD code
CUDA block		Streaming multiprocessor
GRID/kernel		GPU device

Let's take a look at the preceding table's components in detail:

CUDA Threads: CUDA threads execute on a CUDA core. CUDA threads are different from CPU threads. CUDA threads are extremely lightweight and provide fast context switching. The reason for fast context switching is due to the availability of a large register size in a GPU and hardware-based scheduler. The thread context is present in registers compared to CPU, where the thread handle resides in a lower memory hierarchy such as a cache. Hence, when one thread is idle/waiting, another thread that is ready can start executing with almost no delay. Each CUDA thread must execute the same kernel and work independently on different data (SIMT).
CUDA blocks: CUDA threads are grouped together into a logical entity called a CUDA block. CUDA blocks execute on a single Streaming Multiprocessor (SM). One block runs on a single SM, that is, all of the threads within one block can only execute on cores in one SM and do not execute on the cores of other SMs. Each GPU may have one or more SM and hence to effectively make use of the whole GPU; the user needs to divide the parallel computation into blocks and threads.
GRID/kernel: CUDA blocks are grouped together into a logical entity called a CUDA GRID. A CUDA GRID is then executed on the device.

This may sound somewhat complicated at first glance. In this next section, we'll take a look at an example of vector addition to explain this. Hopefully, things will become much clearer.

Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

Related Content you might be interested in

Current Title:

Learn CUDA Programming

Hands-On GPU Programming with Python and CUDA

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA

Hands-On GPU Computing with Python