Book Image

Protocol Buffers Handbook

By : Clément Jean
Book Image

Protocol Buffers Handbook

By: Clément Jean

Overview of this book

Explore how Protocol Buffers (Protobuf) serialize structured data and provides a language-neutral, platform-neutral, and extensible solution. With this guide to mastering Protobuf, you'll build your skills to effectively serialize, transmit, and manage data across diverse platforms and languages. This book will help you enter the world of Protocol Buffers by unraveling the intricate nuances of Protobuf syntax and showing you how to define complex data structures. As you progress, you’ll learn schema evolution, ensuring seamless compatibility as your projects evolve. The book also covers advanced topics such as custom options and plugins, allowing you to tailor validation processes to your specific requirements. You’ll understand how to automate project builds using cutting-edge tools such as Buf and Bazel, streamlining your development workflow. With hands-on projects in Go and Python programming, you’ll learn how to practically apply Protobuf concepts. Later chapters will show you how to integrate data interchange capabilities across different programming languages, enabling efficient collaboration and system interoperability. By the end of this book, you’ll have a solid understanding of Protobuf internals, enabling you to discern when and how to use and redefine your approach to data serialization.
Table of Contents (13 chapters)

What about Protobuf?

So far, we have talked only a little bit about Protobuf. In this section, we are going to start explaining the advantages and disadvantages of Protobuf, and we are going to base the analysis on the criteria we saw in the previous section.

Serialized data size

Due to a lot of optimization and its binary format, Protobuf is one of the best formats in the area. This is especially true for serializing numbers. Let us see why this is the case.

The main reason why Protobuf is good at serializing data in a small number of bytes is that it is a binary format. In Protobuf, additional structural elements such as curly braces, colons, and brackets, typically found in formats such as JSON, XML, and YAML, are absent. Instead, Protobuf data is represented simply as raw bytes.

On top of that, we have optimization such as bitpacking and the use of varints, which help a lot. It makes serializing the same data to a smaller number of bytes easier than it would be if we naively serialized the binary representation of the data as it is in memory.

Bitpacking is a method that compresses data into as few bits as possible. Notice that we are talking about bits and not bytes here. You can think of bitpacking as trying to put multiple pieces of data in the same number of bytes. Here is an example of bitpacking in Protobuf:

0a  -> 0101 00000
    -> 010 is 2 (length-delimited type like string)
    -> 1 is the field tag

Do not worry too much about what this data is representing just yet. We are going to see that in Chapter 5, Serialization, when we talk about the internals of serialization. The most important thing to understand is that we have one byte containing two pieces of metadata. At scale, this reduces the payload size quite a lot.

Most of the integers in Protobuf are varints. Varints is short for variable-size integers and they map integers to different numbers of bytes. How this works is that the smaller values will get mapped to a smaller number of bytes and the bigger ones will get translated to a larger number of bytes. Let us see an example (decimal -> byte(s) in hexadecimal):

1 -> 01
128 -> 80 01
16,384 -> 80 80 01
...

As you can see, compared to fixed-size integers (four or eight bytes), it saves space in a lot of cases. However, this also can result in non-efficient compression of the data. We can also store a 32-bit number (normally 4 bytes) in 5 bytes. Here is an example of such a case:

2,147,483,647 (max 32-bit number value) -> ff ff ff ff 07

We have five bytes instead of four if we serialize with a fixed-size integer.

While we did not get into too much detail about the internals of serialization that make the serialized data size smaller, we learned about the two main optimizations that are used in Protobuf. We saw that it uses bitpacking to compress data into smaller amounts of bits. This is what Protobuf does to limit the impact of metadata on serialized data size. Then, we saw that Protobuf also uses varints. They map smaller integers to smaller numbers of bytes. This is how most integers are serialized in Protobuf.

Availability of data

Deserializing data in Protobuf is fast because we are parsing binary instead of text. Now, this is hard to demonstrate because it is highly dependent on the programming language you are using and the data you have. However, a study conducted by Auth0 (https://auth0.com/blog/beating-json-performance-with-protobuf/) showed that Protobuf, in their use case, has better availability than JSON.

What is interesting in this study is that they found that Protobuf messages were available in their JavaScript code in 4% less time than JSON. Now, this might seem like a marginal improvement, but you have to consider the fact that JSON is JavaScript object literal format. This means that by deserializing Protobuf in JavaScript, we have to convert binary to JSON in order to access the data in our code. And even with that conversion overhead, Protobuf manages to be faster than parsing JSON directly.

This shows that, in JavaScript (JS), even with the overhead of having to transform binary to JSON, Protobuf performs better than JSON itself.

Be aware that, for every implementation, you will have different numbers. Some are more optimized than others. However, a lot of implementations are just wrappers around the C implementation, which means you will get quasi-consistent deserialization.

Finally, it is important to mention that serialization and deserialization speeds depend a lot on the underlying data. Some data formats perform better on certain kinds of data. So, if you are considering using any data schema, I urge you to do some benchmarking first. It will prevent you from being surprised.

Readability of data

As you might expect, readability of data is not where Protobuf shines. As the serialized data is in binary, it requires more effort or tools for humans to be able to understand it. Let’s take a look at the following hexadecimal:

0a 07 43 6c 65 6d 65 6e 74 10 64 1a 06 0a 04 4d 61 72 6b 1a 06 0a 04 4a 6f 68 6e

We have no way to guess that the preceding hexadecimal represents a Person object.

Now, while the default way of serializing data is not human-readable, Protobuf provides libraries to serialize data into JSON or the Protobuf text format. We can take the previous data that we had in hexadecimal and get text output similar to that in the following (Protobuf text format) code block:

name: "Clément"
age: 100
friends: {
  name: "Mark"
}
friends: {
  name: "John"
}

Furthermore, this text format can also be used to create binary. This means that we could have a configuration file written in text and still use the data with Protobuf.

So, as we saw, Protobuf serialized data is by default not human-readable. It is binary and it would take some extra effort or more tools to read. However, since Protobuf provides a way to serialize data to its own text format or to JSON, we can still have human-readable serialized data. This is nice for configuration files, test data, and so on.

Type safety

An advantage that schema-backed data formats have is that the schemas are generally defined with types in mind. Protobuf can define objects called messages that contain fields that are themselves typed. We saw a few examples previously in this chapter. This means that we will be able to check at compile time that the values provided for certain fields are correct.

On top of that, since Protobuf has an extensive set of types, there is the possibility to optimize for specific use cases. An example of this is the type for the age field that we saw in Person. In JSON Schema, for example, we would have a type called integer; however, this does not ensure that we do not provide negative numbers. In Protobuf, we could use uint32 (unsigned 32-bit integer) or a uint64 (usigned 64-bit integer) and thus we would, at compile time, ensure that we do not get negative numbers.

Finally, Protobuf also provides the possibility to nest types and reference these types by their fully qualified name. This basically means that the types have scopes. Let us take an example to make things clearer. Let us say that we have the following definitions (note that this is simplified):

message PhoneNumber {
  enum Type {
    Mobile,
    Work,
    Fax
  }
  Type type;
  string number;
}
message Email {
  enum Type {
    Personal,
    Work
  }
  Type type;
  string email;
}

As we can see, we defined two enums called Type. It is necessary because PhoneNumber supports Fax and Mobile but Email does not. Now, here when we are dealing with the Type enum in PhoneNumber, we are going to refer to it as PhoneNumber.Type, and when we deal with the one in Email, we will refer to it as Email.Type. These names are the fully qualified names of the types and they differentiate the two types, which have the same name, by providing a scope.

Now, let us think about what would happen if we had the following definitions:

enum Type {
  Mobile,
  Work,
  Fax,
  Personal
}
message PhoneNumber {
  Type type;
  string number;
}
message Email {
  Type type;
  string email;
}

We could still create valid Email instances and PhoneNumber instances because we have all the types in the Type enum. However, in this case, we could also create invalid Emails and PhoneNumbers. For example, it is possible to create an Email type with the Mobile type. This means that by providing scoping for types, we can have type safety without having to create two enums:

enum PhoneType { /*...*/ }
enum EmailType { /*...*/ }

This is verbose and unnecessary since Protobuf lets us create types that will be called Phone.Type and Email.Type.

To summarize, Protobuf lets us use types in a very explicit and safe way. We have an extensive set of types that we can use and that will let us ensure that our data is correct at compile time. Finally, we saw that by providing nested types referenced by their fully qualified names, we can differentiate types with the same names and ensure that only certain correct values get set to fields.

Readability of schema

Finally, Protobuf has readable schemas because it was designed as self-documenting and as a language. We can find a lot of concepts that we are already familiar with such as type, comments, and imports. All of this makes onboarding new developers and using Protobuf-backed APIs easy.

The first thing worth mentioning is that we can have comments in our schemas to describe what the fields and types are doing. One of the best examples of that is in the proto files provided in Protobuf itself. Let’s look at duration.proto (simplified for brevity) in the following code block:

// A Duration represents a signed, fixed-length span of time represented
// as a count of seconds and fractions of seconds at nanosecond
// resolution. It is independent of any calendar and concepts like "day" or "month".
message Duration {
  // Signed seconds of the span of time. Must be from -315,576,000,000
  // to +315,576,000,000 inclusive. Note: these bounds are computed from:
  // 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 
  years
  int64 seconds;
  // Signed fractions of a second at nanosecond resolution of the span
  // of time. Durations less than one second are represented with a 0
  // `seconds` field and a positive or negative `nanos` field. For 
  durations
  // of one second or more, a non-zero value for the `nanos` field 
  must be
  // of the same sign as the `seconds` field. Must be from 
  -999,999,999
  // to +999,999,999 inclusive.
  int32 nanos;
}

We can see that we have a comment explaining the purpose of Duration and two others explaining what the fields represent, what their possible values are, and so on. This is important because it helps users use the types properly and it can let us make assumptions.

Next, we can import some other schema files to reuse types. This helps a lot, especially when the project becomes bigger. Instead of having to duplicate code or write everything in one file, we can separate by concern and reuse definitions in multiple places. Furthermore, since imports use the path of the schema file to access the definition in it, we can look at the definition. Now, this requires knowing a little bit more about the compiler and its options, but other than that, we can access the code and start poking around.

Finally, as we mentioned earlier, Protobuf has explicit typing. By looking at a field, we know exactly what kind of value we need to set and what the boundaries are (if any). Furthermore, after learning a little bit about the internals of serialization in Protobuf, you will be able to optimize the types that you use in a granular way.