Chapter 3: Writing and Executing CUDA Kernels with Numba-CUDA | GPU-Accelerated Computing with Python 3 and CUDA

Book Overview & Buying
Table Of Contents

GPU-Accelerated Computing with Python 3 and CUDA

By : Niels Cautaerts, Hossein Ghorbanfekr

Buy this Book

GPU-Accelerated Computing with Python 3 and CUDA

By: Niels Cautaerts, Hossein Ghorbanfekr

Buy this Book

Overview of this book

Writing high-performance Python code doesn’t have to mean switching to C++. This book shows you how to accelerate Python applications using NVIDIA’s CUDA platform and a modern ecosystem of Python tools and libraries. Aimed at researchers, engineers, and data scientists, it offers a practical yet deep understanding of GPU programming and how to fully exploit modern GPU hardware. You’ll begin with the fundamentals of CUDA programming in Python using Numba-CUDA, learning how GPUs work and how to write, execute, and debug custom GPU kernels. Building on this foundation, the book explores memory access optimization, asynchronous execution with CUDA streams, and multi-GPU scaling using Dask-CUDA. Performance analysis and tuning are emphasized throughout, using NVIDIA Nsight profilers. You’ll also learn to use high-level GPU libraries such as JAX, CuPy, and RAPIDS to accelerate numerical Python workflows with minimal code changes. These techniques are applied to real-world examples, including PDE solvers, image processing, physical simulations, and transformer models. Written by experienced GPU practitioners, this hands-on guide emphasizes reproducible workflows using Python 3.10+, CUDA 12.3+, and tools like the Pixi package manager. By the end, you’ll have future-ready skills for building scalable GPU applications in Python.

Preface

Free benefits with your book

Part 1: Fundamentals of GPU programming with CUDA in Python 3

Free Chapter

Chapter 1: Why GPU Programming with CUDA in Python 3?

Technical requirements

What is GPGPU?

Estimating the benefits of parallelization

Understanding the factors that limit parallelism

Measuring the benefits of parallelization empirically

Summary

Questions

Answers

Chapter 2: Setting Up a GPU Programming Environment Locally and in the Cloud

Technical requirements

Setting up a local development environment

Setting up a remote development environment

Summary

Questions

Answers

Chapter 3: Writing and Executing CUDA Kernels with Numba-CUDA

Technical requirements

Writing and executing a CUDA kernel

Dealing with a mismatched grid and problem size

Making kernels modular with device functions

Avoiding race conditions in CUDA kernels

Language features that can be used in Numba-CUDA kernels

Alternative methods for kernel definition and invocation

Summary

Questions

Answers

Chapter 4: Profiling and Debugging CUDA Code

Technical requirements

Why profiling and debugging matter

Challenges in GPU profiling and debugging

Key performance aspects in profiling CUDA applications

Basic GPU profiling tools

NVIDIA Nsight profiling tools

Debugging Numba-CUDA kernels

Summary

Questions

Answers

Part 2: Performance Optimization and Advanced CUDA Topics

Chapter 5: Optimizing the Performance of CUDA Code

How a GPU executes kernels

Maximizing occupancy

Improving performance by inspecting PTX assembly code

Maximizing the use of instruction-level parallelism with loop unrolling

Avoiding warp divergence

Efficient global memory access patterns

Avoiding frequent global memory data access

Using intrinsic and libdevice functions

Using cooperative groups instead of multiple kernel launches

Summary

Questions

Answers

Chapter 6: Enabling Concurrency Using CUDA Streams

Technical requirements

Why CUDA streams are needed

Key concepts in CUDA streams

Creating and managing streams

Asynchronous data transfers

Speeding up image processing with CUDA streams

CUDA events

Multiple CPU threads with CUDA streams

Summary

Questions

Answers

Chapter 7: Scaling to Multiple GPUs

Technical requirements

Introduction to multi-GPU computing

Multi-GPU techniques with Numba

Distributed computing with Dask

Multi-GPU computing with JAX

Summary

Questions

Answers

Part 3: Using High-Level Python Libraries for GPU Computation

Chapter 8: Bringing NumPy and SciPy to the GPU with CuPy

Technical requirements

What CuPy offers

How to use CuPy arrays

How to use CuPy with other libraries

When to use CuPy

How to write GPU-agnostic code

Performance tips

Summary

Questions

Answers

Chapter 9: Bringing pandas and scikit-learn to the GPU with Rapids

Technical requirements

Accelerating data science on the GPU with RAPIDS

How to use cuDF: pandas on the GPU

When to use cuDF

How to write GPU-agnostic DataFrame manipulation code

Alternatives to cuDF

How to use cuML: scikit-learn on the GPU

When to use cuML

How to write GPU-agnostic cuML code

Summary

Questions

Answers

Chapter 10: Solving Optimization Problems on the GPU with JAX

Introduction to JAX's key features

Building a linear regression model with JAX

Building a neural network

Training a physics-informed neural network

Summary

Questions

Answers

Part 4: Real-World Example Applications

Chapter 11: Solving the Heat Equation on the GPU

Technical requirements

The heat equation

Solving the equation

GPU implementation

Performance analysis

Summary

Questions

Answers

Chapter 12: Image Processing and Computer Vision on the GPU

Technical requirements

How a computer represents images

The basics of image processing

Implementing a convolutional filter from scratch

Using convolutional filters from high-level libraries

Case study: detecting and classifying objects in a noisy image

Summary

Questions

Answers

Chapter 13: Simulating Atomic Interactions on the GPU

Technical requirements

How MD simulations work

System initialization

Atomic interactions

Time integration

Data collection

Running the simulation

Performance analysis

Benchmarks

Summary

Questions

Answers

Chapter 14: Implementing Your Own Transformer-Based Language Model

Technical requirements

Introduction to language models

Initializing constant parameters

Understanding the attention mechanism

Implementing the transformer block

Building the language model

Loading data and tokenization

Training the language model

Summary

Questions

Answers

Part 5: Beyond This Book

Chapter 15: Expanding and Deepening Your GPU Programming Knowledge

Technical requirements

Advanced low-level features

Specialized applications

Other GPU programming platforms

Graphics APIs

Summary

Questions

Answers

Chapter 16: Unlock Your Exclusive Benefits

Unlock this Book's Free Benefits in 3 Easy Steps

Other Books You May Enjoy

Subscribe to Deep Engineering

Index

GPU-Accelerated Computing with Python 3 and CUDA

By : Niels Cautaerts, Hossein Ghorbanfekr

GPU-Accelerated Computing with Python 3 and CUDA

By: Niels Cautaerts, Hossein Ghorbanfekr

Overview of this book

Making kernels modular with device functions

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access