Book Image

Mastering Distributed Tracing

By : Yuri Shkuro
Book Image

Mastering Distributed Tracing

By: Yuri Shkuro

Overview of this book

Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.
Table of Contents (21 chapters)
Mastering Distributed Tracing
Contributors
Preface
Other Books You May Enjoy
Leave a review - let other readers know what you think
15
Afterword
Index

Tail-based consistent sampling


Clearly, head-based sampling has its benefits and its challenges. It is fairly simple to implement, yet far from simple to manage at scale. There is one other drawback of head-based sampling that we have not discussed yet: its inability to tune the sampling decision to the behavior of the system captured in the traces. Let's assume our metrics system tells us that the 99.9th percentile of the request latency in a service is very high. It means that only one in a thousand requests on average exhibits the anomalous behavior. If we are tracing with head-based sampling and 0.001 probability, we have one in a million chance that we will sample one of these anomalous requests and capture the trace that might explain the anomaly.

Although we could say that one in a million is not that low, given how much traffic goes through modern cloud-native applications, and we probably will capture some of those interesting traces, it also means that the remaining 999 traces out...