Book Image

Rust High Performance

By : Iban Eguia Moraza
Book Image

Rust High Performance

By: Iban Eguia Moraza

Overview of this book

This book teaches you how to optimize the performance of your Rust code so that it is at the same level as languages such as C/C++. You'll understand and fi x common pitfalls, learn how to improve your productivity by using metaprogramming, and speed up your code. You will master the features of the language, which will make you stand out, and use them to greatly improve the efficiency of your algorithms. The book begins with an introduction to help you identify bottlenecks when programming in Rust. We highlight common performance pitfalls, along with strategies to detect and resolve these issues early. We move on to mastering Rust's type system, which will enable us to optimize both performance and safety at compile time. You will learn how to effectively manage memory in Rust, mastering the borrow checker. We move on to measuring performance and you will see how this affects the way you write code. Moving forward, you will perform metaprogramming in Rust to boost the performance of your code and your productivity. Finally, you will learn parallel programming in Rust, which enables efficient and faster execution by using multithreading and asynchronous programming.
Table of Contents (19 chapters)
Title Page
Copyright and Credits
Dedication
Packt Upsell
Contributors
Preface
Index

Translation issues


When translating your C/C++/Java mindset, or directly porting a project to Rust, you might find yourself writing similar code to what you wrote in your native language, but if you have tried it, you might have noticed that it performs poorly, or at least much worse than your old code. This happens, especially with C/C++, since in the case of Java the performance issue is much lower compared to the high memory and computation footprint of a Java application, with its Java Virtual Machine and its garbage collector.

But why does a direct translation hurt the performance? We will see in this section how Rust's guarantees can sometimes create unnecessary boilerplate instructions, and we will learn how to bypass them by using safe and efficient code. Of course, in some performance-critical situations, unsafe scopes might be needed, but in general, that's not the case.

Indexing degradations

Let's start with a simple example. This Rust code will perform poorly:

    let arr = ['a', 'b', 'c', 'd', 'e', 'f'];

    for i in 0..arr.len() {
      println!("{}", arr[i]);
    }

This will, of course, work, and it's perfectly safe. We create an index that goes from 0 to the length of the array (6 in this case), but exclude the last one, so the i binding will take the values 0, 1, 2, 3, 4, and 5. For each of them, it will get the element at that index in the array and print it in a new line. There is one problem with this approach though. In C/C++, an equivalent code will simply add the size of the element to the pointer in the array and get the next element, but that sometimes causes issues. Look at this code:

    let arr = ['a', 'b', 'c', 'd', 'e', 'f'];

    for i in 0..arr.len() + 1 {
      println!("{}", arr[i]);
    }

In this case, we are iterating until the array length + 1, and since ranges are exclusive, the last index will be 6. This means it will try to get the seventh element in the array, but there is no seventh element. In C/C++, this will create a buffer overflow and will get whatever is next in the memory. In the case that this memory is outside the program, you will get a segmentation fault, but if it's part of the program, it will print whatever was in that position, leading to leaks. Of course, that is not possible in Rust, since Rust is a memory-safe language, so what will happen?

Well, the answer is surprising—it will panic the program, unwind the stack (call all the destructors of all the variables in the stack), and exit the program safely without trying to access invalid memory. Depending on your perspective, you might think great, I will no longer have buffer overflows, or you might think oh my God, the whole server will go down to prevent a buffer overflow. Well, the second can be mitigated by stopping the panic and recovering the proper server state, already implemented in most frameworks, so it's mostly a win-win.

But is it? How does Rust know if the index is out of bounds? In this simple example, the compiler could know that the arr variable has only six elements, so trying to access the seventh would violate memory constraints. But what about this more complex program:

    fn print_request(req: Request) {
      for i in 0..req.content_length {
        println!("{}", req.data[i]);
      }
    }

Here I'm receiving an HTTP request (very naively represented) that has at least one content_length attribute and one data attribute. The first should contain the length of the data field, in a number of bytes, while the second will be a vector of bytes. Let's suppose we don't have the len() function in that data field, and that we trust the content_length attribute. What if somebody were to send us an invalid request with a bigger content_length than the actual length of the content? The compiler wouldn't know this in advance because the request originated at runtime from a TCP connection, but again, Rust must always be memory-safe (unless working in an unsafe scope, which is not the case).

Well, what happens is that the index operation has two parts. First, it checks the bounds of the slice, and if the index is fine, it will return the element; if not, it will panic. And yes, it does this for every indexing operation. So in this case, if the request is a valid request with a supposed 1 million bytes (1 MB), it will compare the index to the length of the vector 1 million times. That is at least 2 million extra instructions (the comparison and the branching for each, at least). That becomes much less efficient than the equivalent C/C++ code.

Using iterators

There is a way around this, though, that gives the same effect as the C/C++ code: using iterators. The previous code can be converted into the following:

    let arr = ['a', 'b', 'c', 'd', 'e', 'f'];

    for c in &arr {
      println!("{}", c);
    }

This will compile roughly to the same machine code as the C/C++ variant since it won't check the bounds of the slice more than once, and it will then use the same pointer arithmetic. This is great when iterating through a slice, but in the case of a direct lookup, it can be an issue. Suppose we will receive thousands of 100-element slices, and we are supposed to get the last element of each and print it. In this case, iterating through all 100 elements of each array just to get the last one is a bad idea, as it would be more efficient to bounds check just the last element. There are a couple of ways of doing this.

The first one is straightforward:

    for arr in array_of_arrays {
      let last_index = arr.len() - 1;
      println!("{}", arr[last_index]);
    }

In this concrete case, where we want to get the last element, we can do something like this:

    for arr in array_of_arrays {
      if let Some(elt) = arr.iter().rev().next() {
        println!("{}", elt);
      }
    }

This will reverse the iterator with the call to rev() and then get the next element (the last one). If it exists, it will print it. But if we have to get a number that is not close to the end or to the beginning of the slice, the best way is to use the get() method:

    for arr in array_of_arrays {
      if let Some(elt) = arr.get(125) {
        println!("{}", elt);
      }
    }

This last one has a double bound check, though. It will first check if the index is correct to return a Some(elt) or a None, and then the last check will see if the returned element is Some or None. If we know for sure, and I mean 100% sure, that the index is always inside the slice, we can use get_unchecked() to get the element. This is an exact equivalent to the C/C++ indexing operation, so it will not do bounds checking, allowing for better performance, but it will be unsafe to use. So in the HTTP example before, an attacker would be able to get what was stored in that index even if it was a memory address outside the slice. You will need to use an unsafe scope, of course:

    for arr in array_of_arrays {
      println!("{}", unsafe { arr.get_unchecked(125) });
    }

The get_unchecked() function will always return something or segfault, so no need to check if it's Some or None. Remember also that upon a segfault, this will not panic, and no destructors will be called. It should only be used if a safe alternative would not meet the performance requirements and if the bounds of the slice were previously known.

In most cases, you will want to use an iterator. Iterators allow for precise iteration of elements, even filtering them, skipping some, taking a maximum amount of them, and finally, collecting them into a collection. They can even be extended or joined with other iterators to allow for any kind of solution. Everything gets managed by the std::iter::Iterator trait. You now understand the most-used methods of the trait, and I leave the rest to you to research in the standard library documentation.

It's important to properly use and understand iterators since they will be very useful for doing really fast loops. Iterators are cost-free abstractions that work the same way as indexing, but will not require bounds checking, making them ideal for efficiency improvements.

Iterator adaptors

Let's start with the most simple method. The basic method for the rest to work is the next() method. This function will return either the next element in the iteration or a None if the iterator has been consumed. This can be used to manually get the next element, or to create a for using a while, for example:

let arr = [10u8, 14, 5, 76, 84];
let mut iter = arr.iter();

while let Some(elm) = iter.next() {
    println!("{}", elm);
}

That would be the same as this:

let arr = [10u8, 14, 5, 76, 84];

for elm in &arr {
    println!("{}", elm);
}

Note

Note the & before the array variable in the for. This is because the basic array type does not implement the Iterator trait, but a reference to the array is a slice, and slices implement the IntoIterator trait, which makes it usable as an iterator.

The next two methods you should know about are the skip() and the take() methods. These make it easy to get only the correct members of a known ordered iterator. For example, let's say we want to take from the third to the tenth element of an iterator with unknown length (at least 10 elements). In this case, the best thing would be to skip the first two and then take the next eight. We then collect them in a vector. Note that the iterator will not run until you call collect() method or use it in a loop. Those are the moments in which the next() method gets executed:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter().cloned().skip(2).take(8).collect();

for elm in collection {
    println!("{}", elm);
}

This will start iterating through the array, and it will first clone each element. That's because by default an iterator will yield references to the elements, and in the case of u8 it's better to copy them than to reference them, as we will see at the end of the chapter. The skip() method will call next() twice and discard what it returns. Then, for each next() operation, it will return the element. Until it calls next() eight times, the take() method will return the element. It will then return None. The collect() method will create an empty vector, and will push elements to it, while the next() method returns Some, then returns the vector.

Note

Note that the collect() method requires a type hint, as it can return any kind of collection—actually, any type that implements the FromIterator trait. We simply tell it that it will be a standard library Vec, and we let the compiler infer the type of element the vector will hold.

There are also a couple of functions that are a generalization of the previous ones, skip_while() and take_while(). These two will skip or take elements, respectively, while the closure they run returns true. Let's see an example:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter()
    .cloned()
    .skip_while(|&elm| elm < 25)
    .take_while(|&elm| elm <= 100)
    .collect();

for elm in collection {
    println!("{}", elm);
}

In this case, the skip_while() method will run next() until it finds an element bigger than or equal to 25. In this case, this is the fourth element (index 3), number 76. The take_while() method starts then calling next() and returning all elements while they are less than or equal to 100. When it finds 143, it returns None. The collect() method will then include all those elements, from the 76 to the 100, both included in a vector, and return it. Note that the 23 is also added to the final result, since even if it's lower than 25, while the skip method stops skipping, it will never skip again.

To fine-tune the filtering of the elements in the iteration, some other very interesting methods are the filter() method and its companion map(). The first lets you filter elements of an iterator based on a closure, while the second lets you map each element to a different one. Let's explore this by using a simple iterator that yields the odd elements of an iterator and collects them into a vector:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter()
    .enumerate()
    .filter(|&(i, _)| i % 2 != 0)
    .map(|(_, elm)| elm)
    .collect();

for elm in collection {
    println!("{}", elm);
}

In this case, we enumerate the iterator by calling to enumerate(). That will yield a tuple with the index and the element for each next() call. This will then be filtered by checking the index. If the index is odd, it will be returned in the next() call; if it's not, it will call next() again. This will then be mapped, as the filter will also return the tuple. The map() function will take only the element, discard the index, and return it.

The filter and map functions can be reduced by using the helpful filter_map() function, which combines the two of them:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter()
    .enumerate()
    .filter_map(|(i, elm)| if i % 2 != 0 { Some(elm) } else { None })
    .collect();

for elm in collection {
    println!("{}", elm);
}

The filter_map() adaptor expects a closure that will return Some(element) when it should return the element, and None when it should retry and call next(). This will avoid some extra code. In this concrete case, you can also use the step_by() method, which only returns one element every n elements. In this case, using a two-step will have the same effect.

When trying to do calculations with iterators, instead of using a for, we can use the great fold() method. This will hold a variable between each call to next() that you will be able to update. That way, you can sum, multiply, and perform any other operation in the iterator. Let's, for example, perform the sum of all the elements of the iterator:

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let sum = arr.iter().fold(0u32, |acc, elm| acc + elm);
println!("{}", sum);

This will print 985, without needing a loop. Of course, this will be implemented with a loop under the hood, but for the programmer, it's a zero-cost abstraction that helps a lot in terms of simplifying the code.

Real-life example

As a real-life example, here is the VSOP87 algorithm's variable function implemented with a fold() method. The VSOP87 algorithm is used to find planets and moons in the sky with really good accuracy, useful for simulators and telescope star finders, for example:

fn calculate_var(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    var.iter()
       .fold(0_f64, |term, &(a, b, c)| term + a * (b + c * t).cos())
}

This is equivalent to this other code:

fn calculate_var(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    let mut term = 0_f64;
    for &(a, b, c) in var {
        term += a * (b + c * t).cos();
    }
    term
}

And in C/C++, this would probably require a structure to hold the tuple. Five lines reduced to one with the same native code. As we talked about, this has no extra cost and will be compiled to the same machine code.

Specialized adaptors

In the case of a summation or a multiplication, there are specialized methods: the sum() and the product() methods. These methods will do the same as the fold() method that is used to add all the numbers in an iterator or to multiply all the items of the iterator. The example we saw before can be reduced to this:

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let sum: u32 = arr.iter().sum();
println!("{}", sum);

Type annotations are required for now, but the code looks much simpler. You can also use the product() function in the same way, and it will be equivalent to this code:

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let prod = arr.iter().fold(0u32, |acc, elm| acc * elm);
println!("{}", prod);

Interaction between adaptors

There are also some functions to control how the iterators interact with other iterators or even themselves. For example, the cycle() function will make the iterator start again from the beginning once it gets to the end of the iterator. This is useful to create an infinite loop with an iterator. There are also a couple of functions that help you deal with multiple iterators at the same time. Let's suppose that you have two slices of the same length and want to generate a new vector with that same length, but with each element being the sum of the elements with the same index in the slices:

let arr1 = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];
let arr2 = [25u32, 12, 73, 2, 98, 122, 213, 22, 39, 300, 144, 163, 127, 3, 56];

let collection: Vec<_> = arr1.iter()
    .zip(arr2.iter())
    .map(|(elm1, elm2)| elm1 + elm2)
    .collect();
println!("{:?}", collection);

In this case, we have used the zip() function that will yield a tuple with each element being the next of each iterator. We can also chain them with the chain() function, which will generate a new iterator that, once the first starts yielding None, will start yielding elements from the second iterator. There are many more iteration functions, but we will leave the standard library here for now and focus on external crates.

Itertools

There is one external crate that can make working with iterators much easier, and gives you superpowers. Remember the idea that these iterators allow you to perform the same operations you would do in C with indexing, but with complete memory safety and zero-cost abstractions? They also make the code much easier to understand. In terms of iterator capabilities, the most important crate is the itertools crate. This crate provides a new trait, the Itertools trait, which gives iterators many new methods and functions that make the life of the developer much easier, while staying true to its core values of performance thanks to zero-cost abstractions. You can add it to your project by adding it to your Cargo.toml file in the [dependencies] section.

Let's explore some of its iterator adapters. We start with a simple one that helps us create batches or chunks of the given iterator, the batching() function. Let's say that we want to use an iterator over one of the previous arrays and we want to make it return elements in groups of three. It's as simple as using that method and creating a closure that directly calls the next() method and returns the required tuple:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

for tuple in arr.iter().batching(|it| match it.next() {
    None => None,
    Some(x) => {
        match it.next() {
            None => None,
            Some(z) => {
                match it.next() {
                    None => None,
                    Some(y) => Some((x, y, z)),
                }
            }
        }
    }
})
{
    println!("{:?}", tuple);
}

This will print the array in groups of three elements, in order:

(10, 5, 14)
(76, 35, 84)
(23, 100, 94)
(143, 200, 23)
(12, 72, 94)

A similar operation can be accomplished by using the chunks() function. We can say that the batching() adaptor is a generalization of the chunks() adaptor, since it gives you the option to create the internal logic of the function. In the case of chunks(), it will only receive as a parameter the number of elements in a chunk, and it will return slices to those chunks.

A really similar operation will be performed with the tuples() method. As you can see, the batching() method is a complete generalization in terms of how you create batches or chunks of an iterator. Let's see the same example we saw previously using the tuples() method:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

for tuple in arr.iter().tuples::<(_, _, _)>() {
    println!("{:?}", tuple);
}

Much less boilerplate code, right? In this case, we are required to specify the number of elements in a tuple, but if we used type inference in the for, we could avoid it:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

for (a, b, c) in arr.iter().tuples() {
    println!("({}, {}, {})", a, b, c);
}

Of course, in this case, we would be pattern-assigning the variables. There is also another interesting function that allows for creating the cartesian product of two iterators.

Unsurprisingly, the name is cartesian_product(). This will create a new iterator with all possible combinations of the previous two:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr1 = [10u32, 14, 5];
let arr2 = [192u32, 73, 44];

for row in arr1.iter().cartesian_product(arr2.iter()) {
    print!("{:?}, ", row);
}

This will print the following:

(10, 192), (10, 73), (10, 44), (14, 192), (14, 73), (14, 44), (5, 192), (5, 73), (5,44),

There are many other methods in the Itertools trait, and I invite you to check the official documentation, since it's very detailed and has many examples. For now, these common methods should help you do any operation you need to perform with slices in a much more efficient way.

Borrowing degradations

Iterations are not the only place where translation degradations occur. There are also a couple of extra points where you can sometimes see that the same code performs much worse in Rust than in C/C++. One of these points is reference handling. Due to borrow checker constraints, you can do three things with variables when passing them to a function: send a reference (borrow), give the new function control of the variable (own), or copy/clone the variable to send it to a function. It seems easy to decide, right? If you do not require the variable anymore, let the function own your variable. If you require it, send the reference, and if you require it and the API only accepts ownership, clone it.

Well, it actually isn't so simple. For example, integers are faster to copy than to reference, and so are small structures. The rule of thumb is, if it's smaller than or equal to usize, copy, always. If it's somewhere between usize and 10 times that size, it's probably better to copy. If it's bigger, it's probably better to reference. If the structure has a heap allocation (such as a Vec or a Box), it's usually better to send a reference.

There are some cases, though, when you cannot decide what happens to the variable. In a macro, for example, the variable is passed as is, and the macro decides what to do with it. For example, the println! macro gets all elements by reference, since it does not require more. The problem is that if you are trying to print an integer, for example, a bottleneck appears. That's what happened some time ago to Robert Grosse, and he wrote an article about it.

Long story short, he had to force the copying of the integer. How did he do that? Well, it's as simple as creating a scope that will return that integer. Since integers implement Copy, the integer will be copied to the scope and then returned, effectively copying it to the macro:

let my_int = 76_u32;
println!("{}", {my_int});

For normal prints, this is not usually necessary, but if you need to quickly print thousands or millions of integers, you will not avoid the I/O interface, but you can at least avoid this bottleneck.

Cyclomatic complexity

Another possible bottleneck is the cyclomatic complexity of functions. While not directly related to the translation of code from other languages, it's true that Rust can sometimes increase the cyclomatic complexity of the code, since it forces you to check for optional (nullable) results, some complex iterators, functional programming, and so on. This is great for code security, but sometimes the compiler has issues properly optimizing the code we write.

The only way to avoid this is to separate the code into smaller code units that will help the compiler optimize better unit by unit. One way of doing that is by creating smaller functions, with no more than 20–25 branches each. A branch is a place where, depending on one variable, the program will run one code or another. The simplest branch is conditional, an if. There are many others, such as loops (especially when the loop contains returns) or the ? operator. This will create two branches, one for each result option. One of them will return the function while the other will assign the variable.

Nested loops and conditionals make this list grow larger, and the branches can be more and more complex, so you will have to try to divide those deeply nested conditionals in new functions. It's even considered a good practice. As you will see in the Tools section, there are tools that will help you find these bottlenecks.