-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
General-purpose GPU computing (GPGPU) has revolutionized scientific and engineering fields, driving advancements in physics, chemistry, machine learning, and beyond. CUDA, the leading GPGPU framework, underpins these breakthroughs—including the AI revolution and the rise of large language models.
Until recently, GPU programming remained in the domain of professional software engineers with deep expertise in C/C++ and GPU hardware. However, NVIDIA's push to integrate low-level CUDA primitives directly into Python, culminating in the 2025 CUDA Python project, is challenging this paradigm. Combined with the RAPIDS ecosystem, which offers GPU-accelerated alternatives to familiar Python libraries, GPU programming is now accessible to data scientists, researchers, and Python developers without requiring them to leave the comfort of Python.
This book is about bridging the gap between low-level GPU programming and high-level Python tools. The first half focuses on CUDA fundamentals using Numba-CUDA, emphasizing performance and profiling. The second half builds on these foundations, exploring high-level libraries in the RAPIDS and JAX ecosystems. The final chapters bring everything together through practical applications across various disciplines, demonstrating both the power and limitations of GPGPU.
Our goal is to make GPU computing accessible to every Python programmer, beyond just using a high-level machine learning framework, empowering you to leverage its potential in your own work.
This book is for Python developers, data scientists, engineers, and researchers who want to accelerate numerical computations using GPUs, without needing to master C/C++. If you're a domain expert who relies on Python for scientific computing or data-intensive tasks and need finer control than high-level frameworks provide, this book will help you unlock GPU performance and understand the fundamentals of hardware acceleration.
You'll get the most out of this book if the following apply:
This book bridges the gap between high-level tools and low-level control, empowering you to write efficient, high-performance GPU code, all while staying in Python.
Chapter 1, Why GPU Programming with CUDA in Python 3?, introduces GPGPU and how GPUs achieve impressive computing speed through massive parallelization. The rest of the chapter explores the benefits and limits of parallelization using theory (e.g., Amdahl's law) and practice (profiling).
Chapter 2, Setting Up a GPU Programming Environment Locally and in the Cloud, guides you through the process of setting up a development environment with Pixi for working with CUDA in Python on your local machine (Windows, WSL, or Linux) or in the cloud (Google Colab, Lambda Labs, or an EC2 virtual machine on AWS).
Chapter 3, Writing and Executing CUDA Kernels with Numba-CUDA, covers CUDA kernel development using the CUDA backend to Numba, a Python JIT compiler. All the basic elements of the CUDA programming model are covered, including kernels, kernel launch configurations (grids, blocks, and threads), device functions, host-device data transfers, thread synchronization, and atomics. In addition, Numba-CUDA-specific features, such as the vectorize decorator, are also covered.
Chapter 4, Profiling and Debugging CUDA Code, introduces several profilers to investigate the performance of CUDA code in Python, including Python's built-in timing functionality, Scalene, Nsight Systems for application- and system-level profiling, and Nsight Compute for detailed kernel-level profiling. In addition, debugging techniques such as printing from a CUDA kernel and emulating GPU execution on the CPU are covered.
Chapter 5, Optimizing the Performance of CUDA Code, explores GPU hardware and the CUDA execution model, laying the groundwork for identifying common bottlenecks in CUDA applications. Core principles for performance optimization, including maximizing occupancy, hiding latency through parallelism, and designing efficient memory access patterns, are covered. Kernel profiling is used to demonstrate their impact.
Chapter 6, Enabling Concurrency Using CUDA Streams, shows how CUDA streams can be used with Numba-CUDA to perform data transfers and computation concurrently, thereby improving the performance of some applications. Several pitfalls related to implicit synchronization are demonstrated. Finally, cross-stream synchronization via CUDA events is also demonstrated.
Chapter 7, Scaling to Multiple GPUs, introduces multi-GPU computing and explains the key use cases. First, the principle is demonstrated at a low level by explicitly managing multiple devices with Numba-CUDA. Then, practical multi-GPU workflows in Python are illustrated using Dask-CUDA and JAX.
Chapter 8, Bringing NumPy and SciPy to the GPU with CuPy, shows how to execute high-level NumPy-like and SciPy-like operations on the GPU, using the CuPy library. The API of CuPy is covered, as well as how CuPy interoperates with other libraries, such as Numba-CUDA. Performance tips are covered, as well as some tricks to write GPU-agnostic code.
Chapter 9, Bringing pandas and scikit-learn to the GPU with RAPIDS, introduces the cuDF and cuML libraries from the RAPIDS ecosystem to perform data science on the GPU using familiar pandas and scikit-learn workflows. The chapter covers the essentials of the cuDF and cuML API and works out an example data science regression use case.
Chapter 10, Solving Optimization Problems on the GPU with JAX, introduces JAX as a powerful framework for optimization, highlighting its key features: JIT compilation, automatic differentiation, and vectorization. JAX is demonstrated by solving a linear regression problem, building a neural network from scratch, and modeling an electrical circuit using a physics-informed neural network.
Chapter 11, Solving the Heat Equation on the GPU, is an application-focused chapter that tackles the Laplace heat equation, a fundamental second-order partial differential equation, using the finite differences method. The chapter covers discretization and boundary conditions and progresses from a CPU implementation to a GPU-accelerated version with Numba-CUDA. The chapter concludes with profiling the GPU kernel to pinpoint performance bottlenecks.
Chapter 12, Image Processing and Computer Vision on the GPU, explores the fundamentals of image processing on the GPU, using both Numba-CUDA kernels and high-level libraries such as cuCIM (RAPIDS). A spatial convolution filter is implemented for blurring and edge detection and then profiled. Then, a real-world object detection task is tackled, using several techniques for segmenting and classifying objects in an image. Three classification approaches are compared: shape-based description, template matching, and a convolutional neural network built with JAX.
Chapter 13, Simulating Atomic Interactions on the GPU, introduces the principles of molecular dynamics (MD) simulations and demonstrates how to implement and run them on the GPU. The theory of interatomic potentials and time integration is covered, culminating in a simulation of a monoatomic gas. The chapter concludes with profiling the code to identify bottlenecks and optimization opportunities.
Chapter 14, Implementing Your Own Transformer-Based Language Model, is the final application chapter that walks through the process of creating a language model from scratch using JAX, Flax, and Optax. The transformer architecture and the attention mechanism are explained and implemented, and the model is trained on the IMDb dataset. In the process, the text preprocessing steps, such as tokenization, are also explained.
Chapter 15, Expanding and Deepening Your GPU Programming Knowledge, rounds off the book by exploring and providing references for several advanced low-level topics (such as interfacing CUDA-C with Numba-CUDA), specialized applications (such as graph analytics), other heterogeneous computing paradigms (such as OpenCL), and graphics APIs (such as Vulkan).
You will need an NVIDIA GPU with compute capability 7.5 or later, a recent version of Python, and the CUDA Toolkit, as well as several Python packages. It is recommended to create the development environment using the Pixi package manager, using the pyproject.toml and pixi.lock files, which can be found in the associated GitHub repository (see the next section). If you don't have access to a CUDA-enabled GPU on your local system, Chapter 2 explains how to rent one in the cloud and set up the environment there.
|
Software/hardware covered in the book |
Operating system requirements |
|
Python 3.12 |
Windows or Linux if developing locally. Any OS if developing in the cloud. |
|
CUDA Python (Numba-CUDA) |
Internet connectivity |
|
RAPIDS (cuDF, cuML, cuCIM) |
|
|
JAX, Flax, Optax |
|
|
Nsight Systems and Nsight Compute |
|
Installation instructions are detailed in Chapter 2 of the book. If any other dependency is required, this is detailed at the start of the chapter.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/GPU-Accelerated-Computing-with-Python-3-and-CUDA. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803245423.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: "We can time all these different parts using time.perf_counter to estimate how much our code would benefit from parallelization"
A block of code is set as follows:
import time
start = time.perf_counter()
# code we want to time here
end = time.perf_counter()
print(end - start)
Any command-line input or output is written as follows:
python -m cProfile -s cumtime script.py
Bold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes appear in the text like this. For example: "However, many problems are memory-bound (i.e., the bottleneck is the speed at which cores can get the data they need)"
Warnings or important notes appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book or have any general feedback, please email us at customercare@packt.com and mention the book's title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packt.com/submit-errata, click Submit Errata, and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packt.com/.