Book Image

Hands-On GPU Computing with Python

By : Avimanyu Bandyopadhyay
Book Image

Hands-On GPU Computing with Python

By: Avimanyu Bandyopadhyay

Overview of this book

GPUs are proving to be excellent general purpose-parallel computing solutions for high-performance tasks such as deep learning and scientific computing. This book will be your guide to getting started with GPU computing. It begins by introducing GPU computing and explaining the GPU architecture and programming models. You will learn, by example, how to perform GPU programming with Python, and look at using integrations such as PyCUDA, PyOpenCL, CuPy, and Numba with Anaconda for various tasks such as machine learning and data mining. In addition to this, you will get to grips with GPU workflows, management, and deployment using modern containerization solutions. Toward the end of the book, you will get familiar with the principles of distributed computing for training machine learning models and enhancing efficiency and performance. By the end of this book, you will be able to set up a GPU ecosystem for running complex applications and data models that demand great processing capabilities, and be able to efficiently manage memory to compute your application effectively and quickly.
Table of Contents (17 chapters)
Free Chapter
1
Section 1: Computing with GPUs Introduction, Fundamental Concepts, and Hardware
5
Section 2: Hands-On Development with GPU Programming
11
Section 3: Containerization and Machine Learning with GPU-Powered Python

Understanding how CUDA-C/C++ works via a simple example

By now, you must be aware of the computational advantages of CUDA C/C++ as per our earlier discussions. C/C++ coupled with CUDA allows you to modify parts of your source code to accelerate your computational results. The primary steps necessary for implementing CUDA code will be explored through a GPU program.

Please manually type in the code used in this book on your IDE from this point onward. Directly copying and pasting from the PDF will ruin the indentations in the code and make it unready to deploy.

First, let's look into the following conventional C++ program that multiplies two array elements using double precision. We'll run the kernel on 500 million elements on the CPU. All the elements of the p and q arrays are set to 24 and 12 respectively.

The following is the C++ program we've just described ...