Book Image

Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma

Book Image

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

<p>Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning. </p><p> </p><p>Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). </p><p> </p><p>By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.</p>

Title Page

Copyright and Credits

Copyright and Credits

Learn CUDA Programming

Dedication

About Packt

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Free Chapter

Introduction to CUDA Programming

Introduction to CUDA Programming

The history of high-performance computing

Technical requirements

Hello World from CUDA

Vector addition using CUDA

Error reporting in CUDA

Data type support in CUDA

CUDA Memory Management

CUDA Memory Management

Technical requirements

NVIDIA Visual Profiler

Global memory/device memory

Read-only data/cache

Registers in GPU

GPU memory evolution

CUDA Thread Programming

CUDA Thread Programming

Technical requirements

CUDA threads, blocks, and the GPU

Understanding CUDA occupancy

Understanding parallel reduction

Identifying the application's performance limiter

Minimizing the CUDA warp divergence effect

Performance modeling and balancing the limiter

Warp-level primitive programming

Cooperative Groups for flexible thread handling

Loop unrolling in the CUDA kernel

Atomic operations

Low/mixed precision operations

Kernel Execution Model and Optimization Strategies

Kernel Execution Model and Optimization Strategies

Technical requirements

Kernel execution with CUDA streams

Pipelining the GPU execution

The CUDA callback function

CUDA streams with priority

Kernel execution time estimation using CUDA events

CUDA dynamic parallelism

Grid-level cooperative groups

CUDA kernel calls with OpenMP

Multi-Process Service

Kernel execution overhead comparison

CUDA Application Profiling and Debugging

CUDA Application Profiling and Debugging

Technical requirements

Profiling focused target ranges in GPU applications

Profiling with NVTX

Visual profiling against the remote machine

Debugging a CUDA application with CUDA error

Asserting local GPU values using CUDA assert

Debugging a CUDA application with Nsight Visual Studio Edition

Debugging a CUDA application with Nsight Eclipse Edition

Debugging a CUDA application with CUDA-GDB

Runtime validation with CUDA-memcheck

Profiling GPU applications with Nsight Systems

Profiling a kernel with Nsight Compute

Scalable Multi-GPU Programming

Scalable Multi-GPU Programming

Technical requirements

Solving a linear equation using Gaussian elimination

GPUDirect peer to peer

Brief introduction to MPI

Additional tricks

Parallel Programming Patterns in CUDA

Parallel Programming Patterns in CUDA

Technical requirements

Matrix multiplication optimization

Prefix sum (scan)

Compact and split

Histogram calculation

Quicksort in CUDA using dynamic parallelism

Programming with Libraries and Other Languages

Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

Mixed-precision operation using cuBLAS

cuRAND for parallel random number generation

cuFFT for Fast Fourier Transformation in GPU

NPP for image and signal processing with GPU

Writing GPU accelerated code in OpenCV

Writing Python code that works with CUDA

NVBLAS for zero coding acceleration in Octave and R

CUDA acceleration in MATLAB

GPU Programming Using OpenACC

GPU Programming Using OpenACC

Technical requirements

OpenACC directives

Asynchronous programming in OpenACC

Additional important directives and clauses

Deep Learning Acceleration with CUDA

Deep Learning Acceleration with CUDA

Technical requirements

Fully connected layer acceleration with cuBLAS

Activation layer with cuDNN

Softmax and loss functions in cuDNN/CUDA

Convolutional neural networks with cuDNN

Recurrent neural network optimization

Profiling deep learning frameworks

Appendix

Useful nvidia-smi commands

WDDM/TCC mode in Windows

Performance modeling

Exploring container-based development

Another Book You May Enjoy

Another Book You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Prefix sum (scan)

Prefix sum (scan) is used to obtain a cumulative number array from the given input numbers array. For example, we can make a prefix-sum sequence as follows:

Input numbers	1	2	3	4	5	6	...
Prefix sums	1	3	6	10	15	21	...

It differs from parallel reduction since reduction just generates the total operation output from the given input data. On the other hand, scan generates outputs from each operation. The easiest way to solve this problem is to iterate all the inputs to generate the output. However, it would take a long time and would be inefficient in GPUs. Hence, the mild approach can parallelize the prefix-sum operation, as follows:

In this approach, we can obtain the output using multiple CUDA cores. However, this method does not reduce the total number of iterations because the first input element should be added for all the outputs...