Mastering Distributed Tracing

Mastering Distributed Tracing

By : Yuri Shkuro

Buy this Book

Mastering Distributed Tracing

By: Yuri Shkuro

Buy this Book

Overview of this book

Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.

Mastering Distributed Tracing

Contributors

Preface

Other Books You May Enjoy

Leave a review - let other readers know what you think

Free Chapter

Why Distributed Tracing?

Microservices and cloud-native applications

What is observability?

The observability challenge of microservices

Traditional monitoring tools

Distributed tracing

My experience with tracing

Why this book?

Summary

References

Take Tracing for a HotROD Ride

Span tags versus logs

Identifying sources of latency

Resource usage attribution

Summary

References

Distributed Tracing Fundamentals

The idea

Request correlation

Anatomy of distributed tracing

Sampling

Preserving causality

Trace models

Clock skew adjustment

Trace analysis

Summary

References

Instrumentation Basics with OpenTracing

Prerequisites

OpenTracing

Exercise 1 – the Hello application

Exercise 2 – the first trace

Exercise 3 – tracing functions and passing context

Exercise 4 – tracing RPC requests

Exercise 5 – using baggage

Exercise 6 – auto-instrumentation

Exercise 7 – extra credit

Summary

References

Instrumentation of Asynchronous Applications

Prerequisites

The Tracing Talk chat application

Instrumenting with OpenTracing

Instrumenting asynchronous code

Summary

References

Tracing Standards and Ecosystem

Styles of instrumentation

Anatomy of tracing deployment and interoperability

Five shades of tracing

Know your audience

The ecosystem

Summary

References

Tracing with Service Meshes

Service meshes

Observability via a service mesh

Prerequisites

The Hello application

Distributed tracing with Istio

Using Istio to generate a service graph

Distributed context and routing

Summary

References

All About Sampling

Head-based consistent sampling

Tail-based consistent sampling

Partial sampling

Summary

References

Turning the Lights On

Tracing as a knowledge base

Summary

Distributed Context Propagation

Summary

Integration with Metrics and Logs

Three pillars of observability

Prerequisites

The Hello application

Integration with metrics

Integration with logs

Summary

References

Gathering Insights with Data Mining

Feature extraction

Components of a data mining pipeline

Feature extraction exercise

Summary

Implementing Tracing in Large Organizations

Why is it hard to deploy tracing instrumentation?

Reduce the barrier to adoption

Where to start

Building the culture

Tracing Quality Metrics

Troubleshooting guide

Don't be on the critical path

Summary

References

Under the Hood of a Distributed Tracing System

Why host your own?

Bet on emerging standards

Architecture and deployment modes

Monitoring and troubleshooting

Resiliency

Summary

References

Afterword

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What is observability?

The term "observability" in control theory states that the system is observable if the internal states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs. At the 2018 Observability Practitioners Summit [4], Bryan Cantrill, the CTO of Joyent and one of the creators of the tool dtrace, argued that this definition is not practical to apply to software systems because they are so complex that we can never know their complete internal state, and therefore the control theory's binary measure of observability is always zero (I highly recommend watching his talk on YouTube: https://youtu.be/U4E0QxzswQc). Instead, a more useful definition of observability for a software system is its "capability to allow a human to ask and answer questions". The more questions we can ask and answer about the system, the more observable it is.

Figure 1.2: The Twitter debate

There are also many debates and Twitter zingers about the difference between monitoring and observability. Traditionally, the term monitoring was used to describe metrics collection and alerting. Sometimes it is used more generally to include other tools, such as "using distributed tracing to monitor distributed transactions." The definition by Oxford dictionaries of the verb "monitor" is "to observe and check the progress or quality of (something) over a period of time; keep under systematic review." However, it is better thought of as the process of observing certain a priori defined performance indicators of our software system, such as those measuring an impact on the end user experience, like latency or error counts, and using their values to alert us when these signals indicate an abnormal behavior of the system. Metrics, logs, and traces can all be used as a means to extract those signals from the application. We can then reserve the term "observability" for situations when we have a human operator proactively asking questions that were not predefined. As Brian Cantrill put it in his talk, this process is debugging, and we need to "use our brains when debugging." Monitoring does not require a human operator; it can and should be fully automated.

"If you want to talk about (metrics, logs, and traces) as pillars of observability–great.
The human is the foundation of observability!"
-- Brian Cantrill

In the end, the so-called "three pillars of observability" (metrics, logs, and traces) are just tools, or more precisely, different ways of extracting sensor data from the applications. Even with metrics, the modern time series solutions like Prometheus, InfluxDB, or Uber's M3 are capable of capturing the time series with many labels, such as which host emitted a particular value of a counter. Not all labels may be useful for monitoring, since a single misbehaving service instance in a cluster of thousands does not warrant an alert that wakes up an engineer. But when we are investigating an outage and trying to narrow down the scope of the problem, the labels can be very useful as observability signals.

Mastering Distributed Tracing

By : Yuri Shkuro

Mastering Distributed Tracing

By: Yuri Shkuro

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Distributed Tracing

Modern Distributed Tracing in .NET

Cloud-Native Observability with OpenTelemetry

Hands-On Enterprise Java Microservices with Eclipse MicroProfile

What is observability?