Book Image

Mastering Distributed Tracing

By : Yuri Shkuro
Book Image

Mastering Distributed Tracing

By: Yuri Shkuro

Overview of this book

Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.
Table of Contents (21 chapters)
Mastering Distributed Tracing
Contributors
Preface
Other Books You May Enjoy
Leave a review - let other readers know what you think
15
Afterword
Index

What is observability?


The term "observability" in control theory states that the system is observable if the internal states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs. At the 2018 Observability Practitioners Summit [4], Bryan Cantrill, the CTO of Joyent and one of the creators of the tool dtrace, argued that this definition is not practical to apply to software systems because they are so complex that we can never know their complete internal state, and therefore the control theory's binary measure of observability is always zero (I highly recommend watching his talk on YouTube: https://youtu.be/U4E0QxzswQc). Instead, a more useful definition of observability for a software system is its "capability to allow a human to ask and answer questions". The more questions we can ask and answer about the system, the more observable it is.

Figure 1.2: The Twitter debate

There are also many debates and Twitter zingers about the difference between monitoring and observability. Traditionally, the term monitoring was used to describe metrics collection and alerting. Sometimes it is used more generally to include other tools, such as "using distributed tracing to monitor distributed transactions." The definition by Oxford dictionaries of the verb "monitor" is "to observe and check the progress or quality of (something) over a period of time; keep under systematic review." However, it is better thought of as the process of observing certain a priori defined performance indicators of our software system, such as those measuring an impact on the end user experience, like latency or error counts, and using their values to alert us when these signals indicate an abnormal behavior of the system. Metrics, logs, and traces can all be used as a means to extract those signals from the application. We can then reserve the term "observability" for situations when we have a human operator proactively asking questions that were not predefined. As Brian Cantrill put it in his talk, this process is debugging, and we need to "use our brains when debugging." Monitoring does not require a human operator; it can and should be fully automated.

"If you want to talk about (metrics, logs, and traces) as pillars of observability–great.

The human is the foundation of observability!"

-- Brian Cantrill

In the end, the so-called "three pillars of observability" (metrics, logs, and traces) are just tools, or more precisely, different ways of extracting sensor data from the applications. Even with metrics, the modern time series solutions like Prometheus, InfluxDB, or Uber's M3 are capable of capturing the time series with many labels, such as which host emitted a particular value of a counter. Not all labels may be useful for monitoring, since a single misbehaving service instance in a cluster of thousands does not warrant an alert that wakes up an engineer. But when we are investigating an outage and trying to narrow down the scope of the problem, the labels can be very useful as observability signals.