Book Image

Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma
Book Image

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

<p>Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning. </p><p> </p><p>Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). </p><p> </p><p>By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.</p>
Table of Contents (18 chapters)
Title Page
Dedication

Hello World from CUDA

CUDA is a heterogeneous programming model that includes provisions for both CPU and GPU. The CUDA C/C++ programming interface consists of C language extensions so that you can target portions of source code for parallel execution on the device (GPU). It is based on industry-standard C/C++ and provides a library of C functions that can be executed on the host (CPU) so that it can interact with the device.

In CUDA, there are two processors that work with each other. The host is usually referred to as the CPU, while the device is usually referred to as the GPU. The host is responsible for calling the device functions. As we've already mentioned, part of the code that runs on the GPU is called device codewhile the serial code that runs on the CPU is called host code.

Let's start by writing our first CUDA code in C. The intention is to take a systematic step-wise approach, start with some sequential code, and convert it into CUDA-aware code by adding some additional keywords. As we mentioned earlier, there is no necessity to learn a new language—all we need to do is add some keywords to the existing language so that we can run it in a heterogeneous environment with CPU and GPU.

Let's take a look at our first piece of code. All this code does is print Hello World! from both the host and device:

#include<stdio.h>
#include<stdlib.h>

__global__ void print_from_gpu(void) {

printf("Hello World! from thread [%d,%d] \
From device\n", threadIdx.x,blockIdx.x);
}

int main(void) {
printf("Hello World from host!\n");
print_from_gpu<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}


Let's try to compile and run the preceding snippet: 

  1. Compile the code: Place the preceding code into a file called hello_world.cu and compile it using the NVIDIA C Compiler (nvcc). Note that the extension of the file is .cu, which tells the compiler that this file has GPU code inside it:
$ nvcc -o hello_world hello_world.cu
  1. Execute the GPU code: We should receive the following output after executing the GPU code:

By now, you might have already observed that the CUDA C code isn't used very differently and only requires that we learn some additional constructs to tell the compiler which function is GPU code and how to call a GPU function. It isn't like we need to learn a new language altogether.

In the preceding code, we added a few constructs and keywords, as follows:

  • __global__: This keyword, when added before the function, tells the compiler that this is a function that will run on the device and not on the host. However, note that it is called by the host. Another important thing to note here is that the return type of the device function is always "void". Data-parallel portions of an algorithm are executed on the device as kernels.
  • <<<,>>>: This keyword tells the compiler that this is a call to the device function and not the host function. Additionally, the 1,1 parameter basically dictates the number of threads to launch in the kernel. We will cover the parameters inside angle brackets later. For now, the 1,1 parameter basically means we are launching the kernel with only one thread, that is, sequential code with a thread since we are not doing anything important in the code apart from printing.
  • threadIdx.x, blockIdx.x: This is a unique ID that's given to all threads. We will cover this topic more in the next section.
  • cudaDeviceSynchronize(): All of the kernel calls in CUDA are asynchronous in nature. The host becomes free after calling the kernel and starts executing the next instruction afterward. This should come as no big surprise since this is a heterogeneous environment and hence both the host and device can run in parallel to make use of the types of processors that are available. In case the host needs to wait for the device to finish, APIs have been provided as part of CUDA programming that make the host code wait for the device function to finish. One such API is cudaDeviceSynchronize, which waits until all of the previous calls to the device have finished.
Try removing the cudaDeviceSynchronize() call and see whether the device output is visible or not. Alternatively, try putting this call before printing it on the host code.

Thread hierarchy

Now, let's start playing around with the two parameters, that is, threadIdx.x and blockIdx.x.

Experiment 1: First, change the parameter from <<<1,1>>> to <<<2,1>> and view the output. The output of running multiple thread-single blocks of Hello World code should be as follows:

As we can see, instead of one thread, we now have two threads printing the value. Note that their unique IDs are different.

Experiment 2: Now, instead of changing the first parameter, let's change the second, that is, change <<<1,1>>> to <<<1,2>>> and observe the output of running multiple single-thread blocks of Hello World code, as follows:

As you can see, the total number of threads that were launched into the kernel is two, just like before—the only difference is that their IDs are different. So, what are these thread and block concepts? To combat this, let's dive into the GPU architecture some more.

GPU architecture

One of the key reasons why CUDA became so popular is because the hardware and software have been designed and tightly bound to get the best performance out of the application. Due to this, it becomes necessary to show the relationship between the software CUDA programming concepts and the hardware design itself.

The following screenshot shows the two sides of CUDA:

We can see that the CUDA software has been mapped to the GPU hardware.

The following table, in accordance with the preceding screenshot, explains software and hardware mapping in terms of the CUDA programming model:

Software Executes on/as Hardware
CUDA thread CUDA Core/SIMD code
CUDA block Streaming multiprocessor
GRID/kernel GPU device

Let's take a look at the preceding table's components in detail:

  • CUDA Threads: CUDA threads execute on a CUDA core. CUDA threads are different from CPU threads. CUDA threads are extremely lightweight and provide fast context switching. The reason for fast context switching is due to the availability of a large register size in a GPU and hardware-based scheduler. The thread context is present in registers compared to CPU, where the thread handle resides in a lower memory hierarchy such as a cache. Hence, when one thread is idle/waiting, another thread that is ready can start executing with almost no delay. Each CUDA thread must execute the same kernel and work independently on different data (SIMT).
  • CUDA blocks: CUDA threads are grouped together into a logical entity called a CUDA block. CUDA blocks execute on a single Streaming Multiprocessor (SM). One block runs on a single SM, that is, all of the threads within one block can only execute on cores in one SM and do not execute on the cores of other SMs. Each GPU may have one or more SM and hence to effectively make use of the whole GPU; the user needs to divide the parallel computation into blocks and threads.
  • GRID/kernel: CUDA blocks are grouped together into a logical entity called a CUDA GRID. A CUDA GRID is then executed on the device.

This may sound somewhat complicated at first glance. In this next section, we'll take a look at an example of vector addition to explain this. Hopefully, things will become much clearer.