Mastering Distributed Tracing

Mastering Distributed Tracing

By : Yuri Shkuro

Buy this Book

Mastering Distributed Tracing

By: Yuri Shkuro

Buy this Book

Overview of this book

Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.

Mastering Distributed Tracing

Contributors

Preface

Other Books You May Enjoy

Leave a review - let other readers know what you think

Free Chapter

Why Distributed Tracing?

Microservices and cloud-native applications

What is observability?

The observability challenge of microservices

Traditional monitoring tools

Distributed tracing

My experience with tracing

Why this book?

Summary

References

Take Tracing for a HotROD Ride

Span tags versus logs

Identifying sources of latency

Resource usage attribution

Summary

References

Distributed Tracing Fundamentals

The idea

Request correlation

Anatomy of distributed tracing

Sampling

Preserving causality

Trace models

Clock skew adjustment

Trace analysis

Summary

References

Instrumentation Basics with OpenTracing

Prerequisites

OpenTracing

Exercise 1 – the Hello application

Exercise 2 – the first trace

Exercise 3 – tracing functions and passing context

Exercise 4 – tracing RPC requests

Exercise 5 – using baggage

Exercise 6 – auto-instrumentation

Exercise 7 – extra credit

Summary

References

Instrumentation of Asynchronous Applications

Prerequisites

The Tracing Talk chat application

Instrumenting with OpenTracing

Instrumenting asynchronous code

Summary

References

Tracing Standards and Ecosystem

Styles of instrumentation

Anatomy of tracing deployment and interoperability

Five shades of tracing

Know your audience

The ecosystem

Summary

References

Tracing with Service Meshes

Service meshes

Observability via a service mesh

Prerequisites

The Hello application

Distributed tracing with Istio

Using Istio to generate a service graph

Distributed context and routing

Summary

References

All About Sampling

Head-based consistent sampling

Tail-based consistent sampling

Partial sampling

Summary

References

Turning the Lights On

Tracing as a knowledge base

Summary

Distributed Context Propagation

Summary

Integration with Metrics and Logs

Three pillars of observability

Prerequisites

The Hello application

Integration with metrics

Integration with logs

Summary

References

Gathering Insights with Data Mining

Feature extraction

Components of a data mining pipeline

Feature extraction exercise

Summary

Implementing Tracing in Large Organizations

Why is it hard to deploy tracing instrumentation?

Reduce the barrier to adoption

Where to start

Building the culture

Tracing Quality Metrics

Troubleshooting guide

Don't be on the critical path

Summary

References

Under the Hood of a Distributed Tracing System

Why host your own?

Bet on emerging standards

Architecture and deployment modes

Monitoring and troubleshooting

Resiliency

Summary

References

Afterword

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The observability challenge of microservices

By adopting microservices architectures, organizations are expecting to reap many benefits, from better scalability of components to higher developer productivity. There are many books, articles, and blog posts written on this topic, so I will not go into that. Despite the benefits and eager adoption by companies large and small, microservices come with their own challenges and complexity. Companies like Twitter and Netflix were successful in adopting microservices because they found efficient ways of managing that complexity. Vijay Gill, Senior VP of Engineering at Databricks, goes as far as saying that the only good reason to adopt microservices is to be able to scale your engineering organization and to "ship the org chart" [2].

Vijay Gill's opinion may not be a popular one yet. A 2018 "Global Microservices Trends" study [6] by Dimensional Research® found that over 91% of interviewed professionals are using or have plans to use microservices in their systems. At the same time, 56% say each additional microservice "increases operational challenges," and 73% find "troubleshooting is harder" in a microservices environment. There is even a famous tweet about adopting microservices:

Figure 1.3: The tweet in question

Consider Figure 1.4, which gives a visual representation of a subset of microservices in Uber's microservices architecture, rendered by Uber's distributed tracing platform Jaeger. It is often called a service dependencies graph or a topology map. The circles (nodes in the graph) represent different microservices. The edges are drawn between nodes that communicate with each other. The diameter of the nodes is proportional to the number of other microservices connecting to them, and the width of an edge is proportional to the volume of traffic going through that edge.

The picture is already so complex that we don't even have space to include the names of the services (in the real Jaeger UI you can see them by moving the mouse over nodes). Every time a user takes an action on the mobile app, a request is executed by the architecture that may require dozens of different services to participate in order to produce a response. Let's call the path of this request a distributed transaction.

Figure 1.4: A visual representation of a subset of Uber's microservices architecture and a hypothetical transaction

So, what are the challenges of this design? There are quite a few:

In order to run these microservices in production, we need an advanced orchestration platform that can schedule resources, deploy containers, auto-scale, and so on. Operating an architecture of this scale manually is simply not feasible, which is why projects like Kubernetes became so popular.
In order to communicate, microservices need to know how to find each other on the network, how to route around problematic areas, how to perform load balancing, how to apply rate limiting, and so on. These functions are delegated to advanced RPC frameworks or external components like network proxies and service meshes.
Splitting a monolith into many microservices may actually decrease reliability. Suppose we have 20 components in the application and all of them are required to produce a response to a single request. When we run them in a monolith, our failure modes are restricted to bugs and potentially a crush of the whole server running the monolith. But if we run the same components as microservices, on different hosts and separated by a network, we introduce many more potential failure points, from network hiccups, to resource constraints due to noisy neighbors. Even if each microservice succeeds in 99.9% of cases, the whole application that requires all of them to work for a given request can only succeed 0.99920 = 98.0% of the time. Distributed, microservices-based applications must become more complicated, for example, implementing retries or opportunistic parallel reads, in order to maintain the same level of availability.
The latency may also increase. Assume each microservice has 1 ms average latency, but the 99^th percentile is 1s. A transaction touching just one of these services has a 1% chance to take ≥ 1s. A transaction touching 100 of these services has 1 - (1 - 0.01)¹⁰⁰ = 63% chance to take ≥ 1s.
Finally, the observability of the system is dramatically reduced if we try to use traditional monitoring tools.

When we see that some requests to our system are failing or slow, we want our observability tools to tell us the story about what happens to that request. We want to be able to ask questions like these:

Which services did a request go through?
What did every microservice do when processing the request?
If the request was slow, where were the bottlenecks?
If the request failed, where did the error happen?
How different was the execution of the request from the normal behavior of the system?
- Were the differences structural, that is, some new services were called, or vice versa, some usual services were not called?
- Were the differences related to performance, that is, some service calls took a longer or shorter time than usual?
What was the critical path of the request?
And perhaps most importantly, if selfishly, who should be paged?

Unfortunately, traditional monitoring tools are ill-equipped to answer these questions for microservices architectures.

Mastering Distributed Tracing

By : Yuri Shkuro

Mastering Distributed Tracing

By: Yuri Shkuro

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Distributed Tracing

Modern Distributed Tracing in .NET

Cloud-Native Observability with OpenTelemetry

Hands-On Enterprise Java Microservices with Eclipse MicroProfile

The observability challenge of microservices