Book Image

Hands-On High Performance with Go

By : Bob Strecansky
Book Image

Hands-On High Performance with Go

By: Bob Strecansky

Overview of this book

Go is an easy-to-write language that is popular among developers thanks to its features such as concurrency, portability, and ability to reduce complexity. This Golang book will teach you how to construct idiomatic Go code that is reusable and highly performant. Starting with an introduction to performance concepts, you’ll understand the ideology behind Go’s performance. You’ll then learn how to effectively implement Go data structures and algorithms along with exploring data manipulation and organization to write programs for scalable software. This book covers channels and goroutines for parallelism and concurrency to write high-performance code for distributed systems. As you advance, you’ll learn how to manage memory effectively. You’ll explore the compute unified device architecture (CUDA) application programming interface (API), use containers to build Go code, and work with the Go build cache for quicker compilation. You’ll also get to grips with profiling and tracing Go code for detecting bottlenecks in your system. Finally, you’ll evaluate clusters and job queues for performance optimization and monitor the application for performance regression. By the end of this Go programming book, you’ll be able to improve existing code and fulfill customer requirements by writing efficient programs.
Table of Contents (20 chapters)
Section 1: Learning about Performance in Go
Section 2: Applying Performance Concepts in Go
Section 3: Deploying, Monitoring, and Iterating on Go Programs with Performance in Mind

CUDA – powering the program

After we have all of our CUDA dependencies installed and running, we can start out with a simple CUDA C++ program:

  1. First, we'll include all of our necessary header files and define the number of elements we'd like to process. 1 << 20 is 1,048,576, which is more than enough elements to show an adequate GPU test. You can shift this if you'd like to see the difference in processing time:
#include <cstdlib>
#include <iostream>

const int ELEMENTS = 1 << 20;

Our multiply function is wrapped in a __global__ specifier. This allows nvcc, the CUDA-specific C++ compiler, to run a particular function on the GPU. This multiply function is relatively straightforward: it takes the a and b arrays, multiplies them together using some CUDA magic, and returns the value in the c array:

__global__ void multiply(int j, float...