-
Book Overview & Buying
-
Table Of Contents
Modern Distributed Tracing in .NET
By :
Now that you know the core concepts around distributed tracing, let’s see how we can use the observability stack to investigate common distributed system problems.
Before we talk about problems, let’s establish a baseline representing the behavior of a healthy system. We also need it to make data-driven decisions to help with common design and development tasks such as the following:
Generic indicators that describe the performance of each service include the following:
Your system might need other indicators to measure durability or data correctness.
Each of these signals is useful when it includes an API route, a status code, and other context properties. For example, the error rate could be low overall but high for specific users or API routes.
Measuring signals on the server and client sides, whenever possible, gives you a better picture. For example, you can detect network failures and avoid “it works on my machine” situations when clients see issues and servers don’t.
Let’s divide performance issues into two overlapping categories:
Figure 1.12 – Azure Monitor latency distribution visualization, with a median request (the 50th percentile) taking around 80 ms and the 95th percentile around 250 ms
Individual issues can be caused by an unfortunate chain of events – transient network issues, high contention in optimistic concurrency algorithms, hardware failures, and so on.
Distributed tracing is an excellent tool to investigate such issues. If you have a bug report, you might have a trace context for a problematic operation. To achieve it, make sure you show the traceparent value on the web page and return traceresponse or a document that users need to record, or log traceresponse when sending requests to your service.
So, if you know the trace context, you can start by checking the trace view. For example, in Figure 1.13, you can see an example of a long request caused by transient network issues.
Figure 1.13 – A request with high latency caused by transient network issues and retries
The frontend request took about 2.6 seconds and the time was spent on the storage service downloading meme content. We see three tries of Azure.Core.Http.Request, each of which was fast, and the time between them corresponds to the back-off interval. The last try was successful.
If you don’t have trace-id, or perhaps if the trace was sampled out, you might be able to filter similar operations based on the context and high latency.
For example, in Jaeger, you can filter spans based on the service, span name, attributes, and duration, which helps you to find a needle in a haystack.
In some cases, you will end up with mysterious gaps – the service was up and running but spent significant time doing nothing, as shown in Figure 1.14:
Figure 1.14 – A request with high latency and gaps in spans
If you don’t get enough data from traces, check whether there are any logs available in the scope of this span.
You might also check resource utilization metrics – was there a CPU spike, or maybe a garbage collection pause at this moment? You might find some correlation using timestamps and context, but it’s impossible to tell whether this was a root cause or a coincidence.
If you have a continuous profiler that correlates profiles to traces (yes, they can do it with Activity.Current), you can check whether there are profiles available for this or similar operations.
We’ll see how to investigate this further with .NET diagnostics tools in Chapter 4, Low-Level Performance Analysis with Diagnostic Tools, but if you’re curious about what happened in Figure 1.14, the service read a network stream that was not instrumented.
Even though we talk about individual performance issues, in many cases we don’t know how widespread they are, especially when we’re at the beginning of an incident. Metrics and rich queries across traces can be used to find out how common a problem is. If you’re on call, checking whether an issue is widespread or becoming more frequent is usually more urgent than finding the root cause.
Note
Long-tail latency requests are inevitable in distributed systems, but there are always opportunities for optimization, with caching, collocation, adjusting timeouts and the retry policy, and so on. Monitoring P95 latency and analyzing traces for long-tail issues helps you find such areas for improvement.
Performance problems manifest as latency or throughput degradation beyond usual variations. Assuming you fail fast or rate-limit incoming calls, you might also see an increase in the error rate for 408, 429, or 503 HTTP status codes.
Such issues can start as a slight decrease in dependency availability, causing a service to retry. With outgoing requests taking more resources than usual, other operations slow down, and the time to process client requests grows, along with number of active requests and connections.
It could be challenging to understand what happened first; you might see high CPU usage and a relatively high GC rate – all symptoms you would usually see on an overloaded system, but nothing that stands out. Assuming you measure the dependency throughput and error rate, you could see the anomaly there, but it might be difficult to tell whether it’s a cause or effect.
Individual distributed traces are rarely useful in such cases – each operation takes longer, and there are more transient errors, but traces may look normal otherwise.
Here’s a list of trivial things to check first, and they serve as a foundation for more advanced analysis:
service.version resource attribute. If you include feature flags on your traces or events, you can query them to check whether degradation is limited to (or started from) the requests with a new feature enabled.Attribute analysis can help here as well – assuming just one of your cloud storage accounts or database partitions is misbehaving, you will see it.
There are more questions to ask about infrastructure, the cloud provider, and other aspects. The point of this exercise is to narrow down and understand the problem as much as possible. If the problem is not in your code, investigation helps to find a better way to handle problems like these in the future and gives you an opportunity to fill the gaps in your telemetry, so next time something similar happens, you can identify it faster.
If you suspect a problem in your code, .NET provides a set of signals and tools to help investigate high CPU, memory leaks, deadlocks, thread pool starvation, and profile code, as we’ll see in Chapter 4, Low-Level Performance Analysis with Diagnostic Tools.