Mastering Distributed Tracing

Mastering Distributed Tracing

By : Yuri Shkuro

Buy this Book

Mastering Distributed Tracing

By: Yuri Shkuro

Buy this Book

Overview of this book

Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.

Mastering Distributed Tracing

Contributors

Preface

Other Books You May Enjoy

Leave a review - let other readers know what you think

Free Chapter

Why Distributed Tracing?

Microservices and cloud-native applications

What is observability?

The observability challenge of microservices

Traditional monitoring tools

Distributed tracing

My experience with tracing

Why this book?

Summary

References

Take Tracing for a HotROD Ride

Span tags versus logs

Identifying sources of latency

Resource usage attribution

Summary

References

Distributed Tracing Fundamentals

The idea

Request correlation

Anatomy of distributed tracing

Sampling

Preserving causality

Trace models

Clock skew adjustment

Trace analysis

Summary

References

Instrumentation Basics with OpenTracing

Prerequisites

OpenTracing

Exercise 1 – the Hello application

Exercise 2 – the first trace

Exercise 3 – tracing functions and passing context

Exercise 4 – tracing RPC requests

Exercise 5 – using baggage

Exercise 6 – auto-instrumentation

Exercise 7 – extra credit

Summary

References

Instrumentation of Asynchronous Applications

Prerequisites

The Tracing Talk chat application

Instrumenting with OpenTracing

Instrumenting asynchronous code

Summary

References

Tracing Standards and Ecosystem

Styles of instrumentation

Anatomy of tracing deployment and interoperability

Five shades of tracing

Know your audience

The ecosystem

Summary

References

Tracing with Service Meshes

Service meshes

Observability via a service mesh

Prerequisites

The Hello application

Distributed tracing with Istio

Using Istio to generate a service graph

Distributed context and routing

Summary

References

All About Sampling

Head-based consistent sampling

Tail-based consistent sampling

Partial sampling

Summary

References

Turning the Lights On

Tracing as a knowledge base

Summary

Distributed Context Propagation

Summary

Integration with Metrics and Logs

Three pillars of observability

Prerequisites

The Hello application

Integration with metrics

Integration with logs

Summary

References

Gathering Insights with Data Mining

Feature extraction

Components of a data mining pipeline

Feature extraction exercise

Summary

Implementing Tracing in Large Organizations

Why is it hard to deploy tracing instrumentation?

Reduce the barrier to adoption

Where to start

Building the culture

Tracing Quality Metrics

Troubleshooting guide

Don't be on the critical path

Summary

References

Under the Hood of a Distributed Tracing System

Why host your own?

Bet on emerging standards

Architecture and deployment modes

Monitoring and troubleshooting

Resiliency

Summary

References

Afterword

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Traditional monitoring tools

Traditional monitoring tools were designed for monolith systems, observing the health and behavior of a single application instance. They may be able to tell us a story about that single instance, but they know almost nothing about the distributed transaction that passed through it. These tools lack the context of the request.

Metrics

It goes like this: "Once upon a time…something bad happened. The end." How do you like this story? This is what the chart in Figure 1.5 tells us. It's not completely useless; we do see a spike and we could define an alert to fire when this happens. But can we explain or troubleshoot the problem?

Figure 1.5: A graph of two time series representing (hypothetically) the volume of traffic to a service

Metrics, or stats, are numerical measures recorded by the application, such as counters, gauges, or timers. Metrics are very cheap to collect, since numeric values can be easily aggregated to reduce the overhead of transmitting that data to the monitoring system. They are also fairly accurate, which is why they are very useful for the actual monitoring (as the dictionary defines it) and alerting.

Yet the same capacity for aggregation is what makes metrics ill-suited for explaining the pathological behavior of the application. By aggregating data, we are throwing away all the context we had about the individual transactions.

In Chapter 11, Integration with Metrics and Logs, we will talk about how integration with tracing and context propagation can make metrics more useful by providing them with the lost context. Out of the box, however, metrics are a poor tool to troubleshoot problems within microservices-based applications.

Logs

Logging is an even more basic observability tool than metrics. Every programmer learns their first programming language by writing a program that prints (that is, logs) "Hello, World!" Similar to metrics, logs struggle with microservices because each log stream only tells us about a single instance of a service. However, the evolving programming paradigm creates other problems for logs as a debugging tool. Ben Sigelman, who built Google's distributed tracing system Dapper [7], explained it in his KubeCon 2016 keynote talk [8] as four types of concurrency (Figure 1.6):

Figure 1.6: Evolution of concurrency

Years ago, applications like early versions of Apache HTTP Server handled concurrency by forking child processes and having each process handle a single request at a time. Logs collected from that single process could do a good job of describing what happened inside the application.

Then came multi-threaded applications and basic concurrency. A single request would typically be executed by a single thread sequentially, so as long as we included the thread name in the logs and filtered by that name, we could still get a reasonably accurate picture of the request execution.

Then came asynchronous concurrency, with asynchronous and actor-based programming, executor pools, futures, promises, and event-loop-based frameworks. The execution of a single request may start on one thread, then continue on another, then finish on the third. In the case of event loop systems like Node.js, all requests are processed on a single thread but when the execution tries to make an I/O, it is put in a wait state and when the I/O is done, the execution resumes after waiting its turn in the queue.

Both of these asynchronous concurrency models result in each thread switching between multiple different requests that are all in flight. Observing the behavior of such a system from the logs is very difficult, unless we annotate all logs with some kind of unique id representing the request rather than the thread, a technique that actually gets us close to how distributed tracing works.

Finally, microservices introduced what we can call "distributed concurrency." Not only can the execution of a single request jump between threads, but it can also jump between processes, when one microservice makes a network call to another. Trying to troubleshoot request execution from such logs is like debugging without a stack trace: we get small pieces, but no big picture.

In order to reconstruct the flight of the request from the many log streams, we need powerful logs aggregation technology and a distributed context propagation capability to tag all those logs in different processes with a unique request id that we can use to stitch those requests together. We might as well be using the real distributed tracing infrastructure at this point! Yet even after tagging the logs with a unique request id, we still cannot assemble them into an accurate sequence, because the timestamps from different servers are generally not comparable due to clock skews. In Chapter 11, Integration with Metrics and Logs, we will see how tracing infrastructure can be used to provide the missing context to the logs.

Mastering Distributed Tracing

By : Yuri Shkuro

Mastering Distributed Tracing

By: Yuri Shkuro

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Distributed Tracing

Modern Distributed Tracing in .NET

Cloud-Native Observability with OpenTelemetry

Hands-On Enterprise Java Microservices with Eclipse MicroProfile

Traditional monitoring tools

Metrics

Logs