Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning. Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.

Title Page

Learn CUDA Programming

Dedication

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Introduction to CUDA Programming

The history of high-performance computing

Technical requirements

Hello World from CUDA

Vector addition using CUDA

Error reporting in CUDA

Data type support in CUDA

Summary

CUDA Memory Management

Technical requirements

NVIDIA Visual Profiler

Global memory/device memory

Summary

CUDA Thread Programming

Technical requirements

CUDA threads, blocks, and the GPU

Understanding CUDA occupancy

Understanding parallel reduction

Identifying the application's performance limiter

Minimizing the CUDA warp divergence effect

Performance modeling and balancing the limiter

Warp-level primitive programming

Cooperative Groups for flexible thread handling

Loop unrolling in the CUDA kernel

Atomic operations

Low/mixed precision operations

Summary

Kernel Execution Model and Optimization Strategies

Technical requirements

Kernel execution with CUDA streams

Pipelining the GPU execution

The CUDA callback function

CUDA streams with priority

Kernel execution time estimation using CUDA events

CUDA dynamic parallelism

Grid-level cooperative groups

CUDA kernel calls with OpenMP

Multi-Process Service

Kernel execution overhead comparison

Summary

CUDA Application Profiling and Debugging

Technical requirements

Profiling focused target ranges in GPU applications

Profiling with NVTX

Visual profiling against the remote machine

Debugging a CUDA application with CUDA error

Asserting local GPU values using CUDA assert

Debugging a CUDA application with Nsight Visual Studio Edition

Debugging a CUDA application with Nsight Eclipse Edition

Debugging a CUDA application with CUDA-GDB

Runtime validation with CUDA-memcheck

Profiling GPU applications with Nsight Systems

Profiling a kernel with Nsight Compute

Summary

Scalable Multi-GPU Programming

Technical requirements

Solving a linear equation using Gaussian elimination

GPUDirect peer to peer

Brief introduction to MPI

GPUDirect RDMA

CUDA streams

Additional tricks

Summary

Parallel Programming Patterns in CUDA

Technical requirements

Matrix multiplication optimization

Convolution

Prefix sum (scan)

Compact and split

N-body

Histogram calculation

Quicksort in CUDA using dynamic parallelism

Radix sort

Summary

Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

Mixed-precision operation using cuBLAS

cuRAND for parallel random number generation

cuFFT for Fast Fourier Transformation in GPU

NPP for image and signal processing with GPU

Writing GPU accelerated code in OpenCV

Writing Python code that works with CUDA

NVBLAS for zero coding acceleration in Octave and R

CUDA acceleration in MATLAB

Summary

GPU Programming Using OpenACC

Technical requirements

OpenACC directives

Asynchronous programming in OpenACC

Additional important directives and clauses

Summary

Deep Learning Acceleration with CUDA

Technical requirements

Fully connected layer acceleration with cuBLAS

Activation layer with cuDNN

Softmax and loss functions in cuDNN/CUDA

Convolutional neural networks with cuDNN

Recurrent neural network optimization

Profiling deep learning frameworks

Summary

Appendix

Useful nvidia-smi commands

WDDM/TCC mode in Windows

Performance modeling

Exploring container-based development

Another Book You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

To get the most out of this book

This book is designed for complete beginners and people who have just started to learn parallel computing. It does not require any specific knowledge besides the basics of computer architecture, and experience with C/C++ programming is assumed. For deep learning enthusiasts, in Chapter 10, Deep Learning Acceleration with CUDA, Python-based sample code is also provided, hence some Python knowledge is expected for that chapter specifically.

The code for this book is primarily developed and tested in a Linux environment. Hence, familiarity with the Linux environment is helpful. Any of the latest Linux flavors, such as CentOS or Ubuntu, are okay. The code can be compiled either using a makefile or the command line. The book primarily uses a free software stack, so there is no need to buy any software licenses. Two key pieces of software that will be used throughout are the CUDA Toolkit and PGI Community Edition.

Since the book primarily covers the latest GPU features making use of CUDA 10.x, in order to fully exploit all the training material, the latest GPU architecture (Pascal onward) will be beneficial. While not all chapters require the latest GPU, having the latest GPU will help you to reproduce the results achieved in the book. Each chapter has a section on the preferred or must-have GPU architecture in the Technical requirements section.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packt.com.
Select the Support tab.
Click on Code Downloads.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learn-CUDA-Programming. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781788996242_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Note that there is an asynchronous alternative to cudaMemcpy."

A block of code is set as follows:

#include<stdio.h>
#include<stdlib.h>

__global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d,%d] \
        From device\n", threadIdx.x,blockIdx.x);
}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

int main(void) {
    printf("Hello World from host!\n");
    print_from_gpu<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Any command-line input or output is written as follows:

$ nvcc -o hello_world hello_world.cu

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "For Windows users, in the VS project properties dialog, you can specify your GPU's compute capability at CUDA C/C++ | Device | Code Generation."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

Related Content you might be interested in

Current Title:

Learn CUDA Programming

Hands-On GPU Programming with Python and CUDA

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA

Hands-On GPU Computing with Python