Book Image

Learn CUDA Programming

By : Jaegeun Han, Bharatkumar Sharma
Book Image

Learn CUDA Programming

By: Jaegeun Han, Bharatkumar Sharma

Overview of this book

<p>Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning. </p><p> </p><p>Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). </p><p> </p><p>By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.</p>
Table of Contents (18 chapters)
Title Page
Dedication

Shared memory

Shared memory has always had a vital role to play in the CUDA memory hierarchy known as the User-Managed Cache. This provides a mechanism for users so that they can read/write data in a coalesced fashion from global memory and store it in memory, which acts like a cache but can be controlled by the user. In this section, we will not only go through the steps we can take to make use of shared memory but also talk about how we can efficiently load/store data from shared memory and how it is internally arranged in banks. Shared memory is only visible to threads in the same block. All of the threads in a block see the same version of a shared variable.

Shared memory has similar benefits to a CPU cache; however, while a CPU cache cannot be explicitly managed, shared memory can. Shared memory has an order of magnitude lower latency than global memory and an order...