Book Image

Hands-On Concurrency with Rust

By : Brian L. Troutwine
Book Image

Hands-On Concurrency with Rust

By: Brian L. Troutwine

Overview of this book

Most programming languages can really complicate things, especially with regard to unsafe memory access. The burden on you, the programmer, lies across two domains: understanding the modern machine and your language's pain-points. This book will teach you to how to manage program performance on modern machines and build fast, memory-safe, and concurrent software in Rust. It starts with the fundamentals of Rust and discusses machine architecture concepts. You will be taken through ways to measure and improve the performance of Rust code systematically and how to write collections with confidence. You will learn about the Sync and Send traits applied to threads, and coordinate thread execution with locks, atomic primitives, data-parallelism, and more. The book will show you how to efficiently embed Rust in C++ code and explore the functionalities of various crates for multithreaded applications. It explores implementations in depth. You will know how a mutex works and build several yourself. You will master radically different approaches that exist in the ecosystem for structuring and managing high-scale systems. By the end of the book, you will feel comfortable with designing safe, consistent, parallel, and high-performance applications in Rust.
Table of Contents (18 chapters)
Title Page
Copyright and Credits
Dedication
Packt Upsell
Contributors
Preface
Index

The machine


In this book, independent of the specific Rust techniques, we'll attempt to teach a kind of mechanical sympathy with the modern parallel computer. There are two model kinds of parallelism we'll touch on—concurrent memory operations and data parallelism. We'll spend most of our time in this book on concurrent memory operations, the kind of parallelism in which multiple CPUs contend to manipulate a shared, addressable memory. Data parallelism, where the CPU is able to operate with a single or multiple instructions on multiple words at a time concurrently, will be touched on, but the details are CPU specific, and the necessary intrinsics are only now becoming available in the base language as this book goes to press. Fortunately, Rust, as a systems language with modern library management, will easily allow us to pull in an appropriate library and emit the correct instructions, or we could inline the assembly ourselves.

Literature on the abstract construction of algorithms for parallel machines must choose a machine model to operate under. The parallel random access machine (PRAM) is common in the literature. In this book, we will focus on two concrete machine architectures:

  • x86
  • ARM

These machines were chosen because they are common and because they each have specific properties that will be important when we get to Chapter 6, Atomics – the Primitives of Synchronization. Actual machines deviate from the the PRAM model in important ways. Most obviously, actual machines have a limited number of CPUs and a bounded amount of RAM. Memory locations are not uniformly accessible from each CPU; in fact, cache hierarchies have a significant impact on the performance of computer programs. None of this is to say that PRAM is an absurd simplification, nor is this true of any other model you'll find in literature. What should be understood is that as we work, we'll need to draw lines of abstraction out of necessity, where further detail does not improve our ability to solve problems well. We also have to understand how our abstractions, suited to our own work, relate to the abstractions of others so that we can learn and share. In this book, we will concern ourselves with empirical methods for understanding our machines, involving careful measurement, examination of assembly code, and experimentation with alternative implementations. This will be combined with an abstract model of our machines, more specific to today's machines than PRAM, but still, in the details, focusing on total cache layers, cache sizes, bus speeds, microcode versions, and so forth. The reader is encouraged to add more specificity should the need arise and should they feel so emboldened.

The CPU

The CPU is a device that interprets a stream of instructions, manipulating storage and other devices connected to it in the process. The simplest model of a CPU, and the one that is often introduced first in undergrad computer science, is that a CPU receives an instruction from some nebulous place, performs the interpretation, receives the next instruction, interprets that, and so forth. The CPU maintains an internal pace via an oscillator circuit, and all instructions take a defined number of oscillator pulses—or clock cycles—to execute. In some CPU models, every instruction will take the same number of clock cycles to execute, and in others the cycle count of instructions will vary. Some CPU instructions modify registers, have very low latency with specialized but exceedingly finite memory locations built into the CPU. Other CPU instructions modify the main memory, RAM. Other instructions move or copy information between registers and RAM or vice versa. RAM—whose read/write latency is much higher than that of registers but is much more plentiful—and other storage devices, such as SSDs, are connected to the CPU by specialized buses.

The exact nature of these buses, their bandwidth and transmission latency, varies between machine architectures. On some systems, every location in the RAM is addressable—meaning that it can be read or written to—in constant time from the available CPUs. In other systems, this is not the case—some RAM is CPU-local and some is CPU-remote. Some instructions control special hardware interrupts that cause memory ranges to be written to other bus-connected storage devices. Mostly, these devices are exceedingly slow, compared to the RAM, which is itself slow compared to the registers.

All of this is to explain that in the simplest model of a CPU, where instructions are executed serially, instructions may well end up stalling for many clock cycles while waiting for reads or writes to be executed. To that end, it's important to understand that almost all CPUs—and especially the CPUs we'll concern ourselves with in this book—perform out-of-order executions of their instructions. Just so long as a CPU can prove that two sequences of instructions access the memory distinctly—that is, they do not interfere with each other—then the CPU is free and will probably reorder instructions. This is one of the things that makes C's undefined behavior concerning uninitialized memory so interesting. Perhaps your program's future has already filled in the memory, or perhaps not. Out-of-order execution makes reasoning about a processor's behavior difficult, but its benefit is that CPUs can execute programs much faster by deferring a sequence of instructions that is stalled on some kind of memory access.

In the same spirit, most modern CPUs—and especially the CPUs we'll concern ourselves with in this book—perform branch prediction. Say that branches in our programs tend to branch the same way at execution time—for example, say we have a feature-flag test that is configured to be enabled for the lifetime of a program. CPUs that perform branch prediction will speculatively execute one side of a branch that tends to branch in a certain way while they wait on other stalled instruction pipelines' instructions. When the branch instruction sequence catches up to its branch test, and if the test goes the predicted way, there's already been a great deal of work done and the instruction sequence can skip well ahead. Unfortunately, if the branch was mispredicted, then all this prior work must be torn down and thrown away, and the correct branch will have to be computed, which is quite expensive. It's for this reason that you'll find that programmers who worry about the nitty-gritty performance characteristics of their programs will tend to fret about branches and try to remove them.

All of this is quite power-hungry, reordering instructions to avoid stalls or racing ahead to perform computations that may be thrown away. Power-hungry implies hot, which implies cooling, which implies more power expenditure. All of which is not necessarily great for the sustainability of technological civilization, depending on how the electricity for all this is generated and where the waste heat is dumped. To that end, many modern CPUs integrate some kind of power-scaling features. Some CPUs will lengthen the time between their clock pulses, meaning they execute fewer instructions in a certain span of time than they might otherwise have. Other CPUs race ahead as fast as they normally would and then shut themselves off for a spell, drawing minimal electricity and cooling in the meantime. The exact method by which to build and run a power-efficient CPU is well beyond the scope of this book. What's important to understand is that, as a result of all this, your program's execution speed will vary from run to run, all other things being equal, as the CPU decides whether or not it's time to save power. We'll see this in the next chapter when we manually set power-saving settings.

Memory and caches

The memory storage of a CPU, alluded to in the last section, is fairly minimal. It is limited to the handful of words in general-purpose registers, plus the special-purpose registers in some limited cases. Registers are very fast, owing to their construction and on-die location, but they are not suited for storage. Modern machines connect CPUs over a bus or buses to the main memory, a very large block of randomly addressable bytes. This random addressability is important as it means that, unlike other kinds of storage, the cost to retrieve the 0th byte from RAM is not distinct from retrieving the 1,000,000,000th byte. We programmers don't have to do any goofy trickery to ensure that our structures appear in the front of the RAM in order to be faster to retrieve or modify, whereas physical location in storage is a pressing concern for spinning hard disks and tape drives. Exactly how our CPUs interact with the memory varies between platforms, and the discussion that follows is heavily indebted to Mark Batty's description in his 2014 book, The C11 and C++11Concurrency Model.

In a machine that exposes a sequentially consistent model of memory access, every memory load or store must be made in lockstep with one another, including in systems with multiple CPUs or multiple threads of execution per CPU. This limits important optimizations—consider the challenge of working around memory stalls in a sequentially consistent model—and so neither of the processors we'll be considering in this book offer this model. It will show up in literature by name because of the ease of reasoning about this model, and is worth knowing about, especially when studying atomics literature.

The x86 platform behaves as if it were sequentially consistent, excepting that every thread of execution maintains a FIFO buffer for writes prior to their being flushed to the main memory. In addition, there is a global lock for the coordination of atomic reads and writes between threads of execution. There's a lot to unpack here. Firstly, I use the words load and store interchangeably with read and write, as does most literature. There is also a variant distinction between plain load/store and atomic load/store. Atomic loads/stores are special in that their effects can be coordinated between threads of execution, allowing for coordination with varying degrees of guarantees. The x86 processor platform provides fence instructions that force the flush of write buffers, stalling other threads of execution attempting to access the written range of main memory until the flush is completed. This is the purpose of the global lock. Without atomicity, writes will be flushed willy-nilly. On x86 platforms, writes to an addressable location in the memory are coherent, meaning they are globally ordered, and reads will see values from that location in the order of the writes. Compared to ARM, the way this works on x86 is very simple—writes happen directly to main memory.

Let's look at an example. Taking inspiration from Batty's excellent dissertation, consider a setup where a parent thread sets two variables, x and y, to 0, then spawns two threads called, say, A and B. Thread A is responsible for setting x and then y to 1, whereas thread B is responsible for loading the value of x into a thread-local variable called thr_x and y into a thread-local variable called thr_y. This looks something like the following:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[A] WRITE y := 1
FLUSH A
[B] READ x
[B] WRITE thr_x := x
[B] READ y
[B] WRITE thr_y := y

In this specific example, thr_x == 1 and thr_y == 1. Had the flushes been ordered differently by the CPU, this outcome would have been different. For instance, look at the following:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[A] WRITE y := 1
[B] READ x
[B] WRITE thr_x := x
FLUSH A
[B] READ y
[B] WRITE thr_y := y

The consequence of this is that thr_x == 0 and thr_y == 1. Without any other coordination, the only other valid interleaving is thr_x == 0 and  thr_y == 0. That is, as a result of the write buffer on x86, the write of x in thread A can never be reordered to occur after the write of y: thr_x == 1 and thr_y == 0. This kind of stinks, unless you enjoy this little program as a parlor trick. We want determinism out of our programs. To that end, x86 provides different fence and lock instructions that control how and when threads flush their local write buffers, and how and when threads may read byte ranges from the main memory. The exact interplay here is… complicated. We'll come back to it in great detail in Chapter 3The Rust Memory Model – Ownership, References, and Manipulation and Chapter 6, Atomics – the Primitives of Synchronization. Suffice it to say that, for now, there's an SFENCE instruction available that forces a sequential consistency. We can employ this instruction as follows:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[A] WRITE y := 1
[A] SFENCE        [B] SFENCE
                  [B] READ x
                  [B] WRITE thr_x := x
                  [B] READ y
                  [B] WRITE thr_y := y

From this, we get thr_x == 1 and thr_y == 1. The ARM processor is a little different—1/0 is a valid interleaving. There is no global lock to the ARM, no fences, and no write buffer. In ARM, a string of instructions—called a propagation list, for reasons that will soon be clear—is maintained in an uncommitted state, meaning that the thread that originated the instructions will see them as in-flight, but they will not have been propagated to other threads. These uncommitted instructions may have been performed—resulting in side-effects in the memory—or not, allowing for speculative execution and the other performance tricks discussed in the previous section. Specifically, reads may be satisfied from a thread's local propagation list, but not writes. Branch instructions cause the propagation list that led to the branch instruction to be committed, potentially out of the order from that specified by the programmer. The memory subsystem keeps track of which propagation list has been sent to which thread, meaning that it is possible for a thread's private loads and stores to be out of order and for committed loads and stores to appear out of order between threads. Coherency is maintained on ARM with more active participation from the memory subsystem.

Whereas on x86, the main memory is something that has actions done to it, on ARM, the memory subsystem responds to requests and may invalidate previous, uncommitted requests. A write request involves a read-response event, by which it is uniquely identified, and a read request must reference a write.

This sets up a data-dependency chain. Responses to reads may be invalidated at a later time, but not writes. Each location receives a coherence-commitment ordering, which records a global order of writes, built up per-thread as propagation lists are propagated to threads.

This is very complicated. The end result is that writes to a thread's view of the memory may be done out of programmer order, writes committed to main memory may also be done out of order, and reads can be requested out of programmer order. Therefore, the following code is perfectly valid, owing to the lack of branches in our example: 

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[B] READ x
[A] WRITE y := 1
[B] WRITE thr_y := y
[B] WRITE thr_x := x
[B] READ y

ARM provides instructions for control of dependencies, called barriers. There are three types of dependency in ARM. The address dependency means a load is used to compute the address for access to the memory. The control dependency means that the program flow that leads to memory access depends on a load. We're already familiar with the data dependency.

While there is a lot of main memory available, it's not particularly fast. Well, let's clarify that. Fast here is a relative term. A photon in a vacuum will go about 30.5 centimeters in 1 nanosecond, meaning that in ideal circumstances, I, on the west coast of the United States, ought to be able to send a message to the east coast of the United States and receive a response in about 80 milliseconds. Of course, I'm ignoring request-processing time, the realities of the internet, and other factors; we're just dealing with ballpark figures, here. Consider that 80 milliseconds is 80,000,000 nanoseconds. A read access from a CPU to the main memory is around 100 nanoseconds, give or take your specific computer architecture, the details of your chips, and other factors. All this is to clarify that, when we say, not particularly fast, we're working outside normal human time scales.

It is difficult, sometimes, to keep from misjudging things as fast enough. Say, we have a 4 GHz processor. How many clock cycles do we get per nanosecond? Turns out, it's 4. Say, we have an instruction that needs to access main memory—which, remember, takes 100 nanoseconds—and happens to be able to do its work in exactly 4 cycles, or one nanosecond. That instruction will then be stalled for 99 nanoseconds while it waits, meaning we're potentially losing out on 99 instructions that could have been executed by the CPU. The CPU will make up some of that loss with its optimization tricks. These only go so far, unless our computation is very, very lucky.

In an effort to avoid the performance impact of the main memory on computation, processor manufacturers introduced caching between the processor and the main memory. Many machines these days have three layers of data cache, as well as an instruction cache, called dCACHE and iCACHE, which are tools we'll be using later. You will see dCACHE often consists of three layers these days, each layer being successively larger than the last, but also slower for cost or power concerns. The lowest, smallest layer is called L1, the next L2, and so forth. CPUs read into caches from the main memory in working blocks—or simply blocks—that are the size of the cache being read into. Memory access will preferentially be done on the L1 cache, then L2, then L3, and so forth, with time penalties at each level for missing and block reads. Cache hits, however, are significantly faster than going directly to the main memory. L1 dCACHE references are 0.5 nanoseconds, or fast enough that our hypothetical 4-cycle instruction is 8 times slower than the memory access it requires. L2 dCACHE references are 7 nanoseconds, still a fair sight better than the main memory. Of course, the exact numbers will vary from system to system, and we'll do quite a bit in the next chapter to measure them directly. Cache coherency—maintaining a high ratio of hits to misses—is a significant component of building fast software on modern machines. We'll never get away from the CPU cache in this book.

Memory model

The details of a processor's handling of memory is both complicated and very specific to that processor. Programming languages—and Rust is no exception here—invent a memory model to paper over the details of all supported processors while, ideally, leaving the programmer enough freedom to exploit the specifics of each processor. Systems languages also tend to allow absolute freedom in the form of escape hatches from the language memory model, which is exactly what Rust has in terms of unsafe.

With regard to its memory model, Rust is very much inspired by C++. The atomic orderings exposed in Rust are those of LLVM's, which are those of C++11. This is a fine thing—any literature to do with either C++ or LLVM will be immediately applicable to Rust. Memory order is a complex topic, and it's often quite helpful to lean on material written for C++ when learning. This is especially important when studying up on lock-free/wait-free structures—which we'll see later in this book—as literature on those topics often deals with it in terms of C++. Literature written with C11 in mind is also suitable, if maybe a little less straightforward to translate.

Now, this is not to say that the C++ programmer will be immediately comfortable in concurrent Rust. The digs will be familiar, but not quite right. This is because Rust's memory model also includes a notion of reference consumption. In Rust, a thing in memory must be used zero or one times, but no more. This, incidentally, is an application of a version of linear-type theory called affine typing, if you'd like to read up more on the subject. Now, the consequence of restricting memory access in this way is that Rust is able to guarantee at compile-time safe memory access—threads cannot reference the same location in memory at the same time without coordination; out-of-order access in the same thread are not allowed, and so forth. Rust code is memory safe without relying on a garbage collector. In this book's estimation, memory safety is a good win, even though the restrictions that are introduced do complicate implementing certain kinds of structures that are more straightforward to build in C++ or similar languages.

This is a topic we'll cover in much greater detail in Chapter 3, The Rust Memory Model – Ownership, References and Manipulation.