Observability with Grafana

By : Rob Chapman, Peter Holmes

Observability with Grafana

By: Rob Chapman, Peter Holmes

Overview of this book

To overcome application monitoring and observability challenges, Grafana Labs offers a modern, highly scalable, cost-effective Loki, Grafana, Tempo, and Mimir (LGTM) stack along with Prometheus for the collection, visualization, and storage of telemetry data. Beginning with an overview of observability concepts, this book teaches you how to instrument code and monitor systems in practice using standard protocols and Grafana libraries. As you progress, you’ll create a free Grafana cloud instance and deploy a demo application to a Kubernetes cluster to delve into the implementation of the LGTM stack. You’ll learn how to connect Grafana Cloud to AWS, GCP, and Azure to collect infrastructure data, build interactive dashboards, make use of service level indicators and objectives to produce great alerts, and leverage the AI & ML capabilities to keep your systems healthy. You’ll also explore real user monitoring with Faro and performance monitoring with Pyroscope and k6. Advanced concepts like architecting a Grafana installation, using automation and infrastructure as code tools for DevOps processes, troubleshooting strategies, and best practices to avoid common pitfalls will also be covered. After reading this book, you’ll be able to use the Grafana stack to deliver amazing operational results for the systems your organization uses.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download a free PDF copy of this book

Part 1: Get Started with Grafana and Observability

Free Chapter

Chapter 1: Introducing Observability and the Grafana Stack

Observability in a nutshell

Telemetry types and technologies

Introducing the user personas of observers

Introducing the Grafana stack

Alternatives to the Grafana stack

Deploying the Grafana stack

Summary

Chapter 2: Instrumenting Applications and Infrastructure

Common log formats

Exploring metric types and best practices

Tracing protocols and best practices

Using libraries to instrument efficiently

Infrastructure data technologies

Summary

Chapter 3: Setting Up a Learning Environment with Demo Applications

Technical requirements

Introducing Grafana Cloud

Installing the prerequisite tools

Installing the OpenTelemetry Demo application

Exploring telemetry from the demo application

Troubleshooting your OpenTelemetry Demo installation

Summary

Part 2: Implement Telemetry in Grafana

Chapter 4: Looking at Logs with Grafana Loki

Technical requirements

Updating the OpenTelemetry demo application

Introducing Loki

Understanding LogQL

Exploring Loki’s architecture

Tips, tricks, and best practices

Summary

Chapter 5: Monitoring with Metrics Using Grafana Mimir and Prometheus

Technical requirements

Updating the OpenTelemetry demo application

Introducing PromQL

Exploring data collection and metric protocols

Understanding data storage architectures

Using exemplars in Grafana

Summary

Chapter 6: Tracing Technicalities with Grafana Tempo

Technical requirements

Updating the OpenTelemetry Demo application

Introducing Tempo and the TraceQL query language

Exploring tracing protocols

Understanding the Tempo architecture

Summary

Chapter 7: Interrogating Infrastructure with Kubernetes, AWS, GCP, and Azure

Technical requirements

Monitoring Kubernetes using Grafana

Visualizing AWS telemetry with Grafana Cloud

Monitoring GCP using Grafana

Monitoring Azure using Grafana

Best practices and approaches

Summary

Part 3: Grafana in Practice

Chapter 8: Displaying Data with Dashboards

Technical requirements

Creating your first dashboard

Developing your dashboard further

Using visualizations in Grafana

Developing a dashboard purpose

Advanced dashboard techniques

Managing and organizing dashboards

Case study – an overall system view

Summary

Chapter 9: Managing Incidents Using Alerts

Technical requirements

Being alerted versus being alarmed

Writing great alerts using SLIs and SLOs

Grafana Alerting

Grafana OnCall

Grafana Incident

Summary

Chapter 10: Automation with Infrastructure as Code

Technical requirements

Benefits of automating Grafana

Introducing the components of observability systems

Automating collection infrastructure with Helm or Ansible

Getting to grips with the Grafana API

Managing dashboards and alerts with Terraform or Ansible

Summary

Chapter 11: Architecting an Observability Platform

Architecting your observability platform

Developing a proof of concept

Setting the right access levels

Sending telemetry to other consumers

Summary

Part 4: Advanced Applications and Best Practices of Grafana

Chapter 12: Real User Monitoring with Grafana

Introducing RUM

Setting up Grafana Frontend Observability

Exploring Web Vitals

Pivoting from frontend to backend data

Enhancements and custom configurations

Summary

Chapter 13: Application Performance with Grafana Pyroscope and k6

Using Pyroscope for continuous profiling

Using k6 for load testing

Summary

Chapter 14: Supporting DevOps Processes with Observability

Introducing the DevOps life cycle

Using Grafana for fast feedback during the development life cycle

Using Grafana to monitor infrastructure and platforms

Summary

Chapter 15: Troubleshooting, Implementing Best Practices, and More with Grafana

Best practices and troubleshooting for data collection

Best practices and troubleshooting for the Grafana stack

Avoiding pitfalls of observability

Future trends in application monitoring

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Telemetry types and technologies

The boring but important part of observability tools is telemetry – capturing data that is useful, shipping it from place to place, and producing visualizations, alerts, and reports that offer value to the organization.

Three main types of telemetry are used to build monitoring and observability systems – metrics, logs, and distributed traces. Other telemetry types may be used by some vendors and in particular circumstances. We will touch on these here, but they will be explored in more detail in Chapters 12 and 13 of this book.

Metrics

Metrics can be thought of as numeric data that is recorded at a point in time and enriched with labels or dimensions to enable analysis. Metrics are frequently generated and are easy to search, making them ideal for determining whether something is wrong or unusual. Let’s look at an example of metrics showing temporal changes:

Figure 1.3 – Metrics showing changes over time

Taking our example of the Panama Canal, we could represent the water level in each lock as a metric, to be measured at regular intervals. To be able to use the data effectively, we might add some of these labels:

The lock name: Agua Clara
The lock chamber: Lower lock
The canal: Panama Canal

Logs

Logs are considered to be unstructured string data types. They are recorded at a point in time and usually contain a huge amount of information about what is happening. While logs can be structured, there is no guarantee of that structure persisting, because the log producer has control over the structure of the log. Let’s look at an example:

Jun 26 2016 20:31:01 pc-ac-g1 gate-events no obstructions seen
Jun 26 2016 20:32:01 pc-ac-g1 gate-events starting motors
Jun 26 2016 20:32:30 pc-ac-g1 gate-events motors engaged successfully
Jun 26 2016 20:35:30 pc-ac-g1 gate-events stopping motors
Jun 26 2016 20:35:30 pc-ac-g1 gate-events gate open complete

In our example, the various operations involved in opening or closing a lock gate could be represented as logs.

Almost every system produces logs, and they often give very detailed information. This is great for understanding what happened. However, the volume of data presents two problems:

Searching can be inefficient and slow.
As the data is in text format, knowing what to search for can be difficult. For example, error occurred, process failed, and action did not complete successfully could all be used to describe a failure, but there are no shared strings to search for.

Let’s consider a real log entry from a computer system to see how log data is usually represented:

Figure 1.4 – Logs showing discrete events in time

We can clearly see that we have a number of fields that have been extracted from the log entry by the system. These fields detail where the log entry originated from, what time it occurred, and various other items.

Distributed traces

Distributed traces show the end-to-end journey of an action. They are captured from every step that is taken to complete the action. Let’s imagine a trace that covers the passage of a ship through the lock system. We will be interested in the time a ship enters and leaves each lock, and we will want to be able to compare different ships using the system. A full passage can be given an identifier, usually called a trace ID. Traces are made up of spans. In our example, a span would cover the entry and exit for each individual lock. These spans are given a second identifier, called a span ID. To tie these two together, each span in a trace references the trace ID for the whole trace. The following screenshot shows an example of how a distributed trace is represented for a computer application:

Figure 1.5 – Traces showing the relationship of actions over time

Now that we have introduced metrics, logs, and traces, let’s consider a more detailed example of a ship passing through the locks, and how each telemetry type would be produced in this process:

Ship enters the first lock:
- Span ID created
- Trace ID created
- Contextual information is added to the span, for example, a ship identification
- Key events are recorded in the span with time stamps, for example, gates are opened and closed
Ship exits the first lock:
- Span closed and submitted to the recording system
- Second lock notified of trace ID and span ID
Ship enters the second lock:
- Span ID created
- Trace ID added to span
- Contextual information is added to the span
- Key events recorded in the span with time stamps
Ship exits the second lock:
- Span closed and submitted to the recording system
- Third lock notified of trace ID and span ID
Ship enters the third lock:
- Repeat step 3
Ship exits the third lock:
- Span closed and submitted to the recording system

Now let’s look at some other telemetry types.

Other telemetry types

Metrics, logs, and traces are often called the three pillars or the golden triangle of observability. As we outlined earlier, observability is the ability to understand a system. While metrics, logs, and traces give us a very good ability to understand a system, they are not the only signals we might need, as this depends at what abstraction layer we need to observe the system. For instance, when looking at a very detailed level, we may be very interested in the stack trace of an application’s activity at the CPU and RAM level. Conversely, if we are interested in the execution of a CI/CD pipeline, we may just be interested in whether a deployment occurred and nothing more.

Profiling data (stack traces) can give us a very detailed technical view of the system’s use of resources such as CPU cycles or memory. With cloud services often charged per hour for these resources, this kind of detailed analysis can easily create cost savings.

Similarly, events can be consumed from a platform, such as CI/CD. These can offer a huge amount of insight that can reduce the Mean Time to Recovery (MTTR). Imagine responding to an out-of-hours alert and seeing that a new version of a service was deployed immediately before the issues started occurring. Even better, imagine not having to wake up because the deployment process could check for failures and roll back automatically. Events differ from logs only in that an event represents a whole action. In our earlier example in the Logs section, we created five logs, but all of these referred to stages of the same event (opening the lock gate). As a relatively generic term, event gets used with other meanings.

Now that we’ve introduced the fundamental concepts of the technology, let’s talk about the customers who will use observability data.

Observability with Grafana

By : Rob Chapman, Peter Holmes

Observability with Grafana

By: Rob Chapman, Peter Holmes

Overview of this book

Related Content you might be interested in

Current Title:

Observability with Grafana

Cloud-Native Observability with OpenTelemetry

Learn Grafana 10.x

Implementing Enterprise Observability for Success

Telemetry types and technologies

Metrics

Logs

Distributed traces

Other telemetry types