Book Image

Mastering Distributed Tracing

By : Yuri Shkuro
Book Image

Mastering Distributed Tracing

By: Yuri Shkuro

Overview of this book

Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.
Table of Contents (21 chapters)
Mastering Distributed Tracing
Contributors
Preface
Other Books You May Enjoy
Leave a review - let other readers know what you think
15
Afterword
Index

My experience with tracing


My first experience with distributed tracing was somewhere around 2010, even though we did not use that term at the time. I was working on a trade capture and trade processing system at Morgan Stanley. It was built as a service-oriented architecture (SOA), and the whole system contained more than a dozen different components deployed as independent Java applications. The system was used for over-the-counter interest rate derivatives products (like swaps and options), which had high complexity but not a huge trading volume, so most of the system components were deployed as a single instance, with the exception of the stateless pricers that were deployed as a cluster.

One of the observability challenges with the system was that each trade had to go through a complicated sequence of additional changes, matching, and confirmation flows, implemented by the different components of the system.

To give us visibility into the various state transitions of the individual trades, we used an APM vendor (now defunct) that was essentially implementing a distributed tracing platform. Unfortunately, our experience with that technology was not particularly stellar, with the main challenge being the difficulty of instrumenting our applications for tracing, which involved creating aspect-oriented programming (AOP) - style instructions in the XML files and trying to match on the signature of the internal APIs. The approach was very fragile, as changes to the internal APIs would cause the instrumentation to become ineffective, without good facilities to enforce it via unit testing. Getting instrumentation into existing applications is one of the main difficulties in adopting distributing tracing, as we will discuss in this book.

When I joined Uber in mid-2015, the engineering team in New York had only a handful of engineers, and many of them were working in the metrics system, which later became known as M3. At the time, Uber was just starting its journey towards breaking the existing monolith and replacing it with microservices. The Python monolith, appropriately called "API", was already instrumented with another home-grown tracing-like system called Merckx.

The major shortcoming with Merckx was its design for the days of a monolithic application. It lacked any concept of distributed context propagation. It recorded SQL queries, Redis calls, and even calls to other services, but there was no way to go more than one level deep. It also stored the existing in-process context in a global, thread-local storage, and when many new Python microservices at Uber began adopting an event-loop-based framework Tornado, the propagation mechanism in Merckx was unable to represent the state of many concurrent requests running on the same thread. By the time I joined Uber, Merckx was in maintenance mode, with hardly anyone working on it, even though it had active users. Given the new observability theme of the New York engineering team, I, along with another engineer, Onwukike Ibe, took the mantle of building a fully-fledged distributed tracing platform.

I had no experience with building such systems in the past, but after reading the Dapper paper from Google, it seemed straightforward enough. Plus, there was already an open source clone of Dapper, the Zipkin project, originally built by Twitter. Unfortunately, Zipkin did not work for us out of the box.

In 2014, Uber started building its own RPC framework called TChannel. It did not really become popular in the open source world, but when I was just getting started with tracing, many services at Uber were already using that framework for inter-process communications. The framework came with tracing instrumentation built-in, even natively supported in the binary protocol format. So, we already had traces being generated in production, only nothing was gathering and storing them.

I wrote a simple collector in Go that was receiving traces generated by TChannel in a custom Thrift format and storing them in the Cassandra database in the same format that the Zipkin project used. This allowed us to deploy the collectors alongside the Zipkin UI, and that's how Jaeger was born. You can read more about this in a post on the Uber Engineering blog [9].

Having a working tracing backend, however, was only half of the battle. Although TChannel was actively used by some of the newer services, many more existing services were using plain JSON over HTTP, utilizing many different HTTP frameworks in different programming languages. In some of the languages, for example, Java, TChannel wasn't even available or mature enough. So, we needed to solve the same problem that made our tracing experiment at Morgan Stanley fizzle out: how to get tracing instrumentation into hundreds of existing services, implemented with different technology stacks.

As luck would have it, I was attending one of the Zipkin Practitioners workshops organized by Adrian Cole from Pivotal, the lead maintainer of the Zipkin project, and that same exact problem was on everyone's mind. Ben Sigelman, who founded his own observability company Lightstep earlier that year, was at the workshop too, and he proposed to create a project for a standardized tracing API that could be implemented by different tracing vendors independently, and could be used to create completely vendor-neutral, open source, reusable tracing instrumentation for many existing frameworks and drivers. We brainstormed the initial design of the API, which later became the OpenTracing project [10] (more on that in Chapter 6, Tracing Standards and Ecosystem). All examples in this book use the OpenTracing APIs for instrumentation.

The evolution of the OpenTracing APIs, which is still ongoing, is a topic for another story. Yet even the initial versions of OpenTracing gave us the peace of mind that if we started adopting it on a large scale at Uber, we were not going to lock ourselves into a single implementation. Having different vendors and open source projects participating in the development of OpenTracing was very encouraging. We implemented Jaeger-specific, fully OpenTracing-compatible tracing libraries in several languages (Java, Go, Python, and Node.js), and started rolling them out to Uber microservices. Last time I checked, we had close to 2,400 microservices instrumented with Jaeger.

I have been working in the area of distributed tracing even since. The Jaeger project has grown and matured. Eventually, we replaced the Zipkin UI with Jaeger's own, more modern UI built with React, and in April 2017, we open sourced all of Jaeger, from client libraries to the backend components.

By supporting OpenTracing, we were able to rely on the ever-growing ecosystem of open source instrumentation hosted at the opentracing-contrib organization on GitHub [11], instead of writing our own the way some other projects have done. This freed the Jaeger developers to focus on building a best-of-class tracing backend with data analysis and visualization features. Many other tracing solutions have borrowed features first introduced in Jaeger, just like Jaeger borrowed its initial feature set from Zipkin.

In the fall of 2017, Jaeger was accepted as an incubating project to CNCF, following in the footsteps of the OpenTracing project. Both projects are very active, with hundreds of contributors, and are used by many organizations around the world. The Chinese giant Alibaba even offers hosted Jaeger as part of its Alibaba Cloud services [12]. I probably spend 30-50% of my time at work collaborating with contributors to both projects, including code reviews for pull requests and new feature designs.