Book Image

Data Observability for Data Engineering

By : Michele Pinto, Sammy El Khammal
Book Image

Data Observability for Data Engineering

By: Michele Pinto, Sammy El Khammal

Overview of this book

In the age of information, strategic management of data is critical to organizational success. The constant challenge lies in maintaining data accuracy and preventing data pipelines from breaking. Data Observability for Data Engineering is your definitive guide to implementing data observability successfully in your organization. This book unveils the power of data observability, a fusion of techniques and methods that allow you to monitor and validate the health of your data. You’ll see how it builds on data quality monitoring and understand its significance from the data engineering perspective. Once you're familiar with the techniques and elements of data observability, you'll get hands-on with a practical Python project to reinforce what you've learned. Toward the end of the book, you’ll apply your expertise to explore diverse use cases and experiment with projects to seamlessly implement data observability in your organization. Equipped with the mastery of data observability intricacies, you’ll be able to make your organization future-ready and resilient and never worry about the quality of your data pipelines again.
Table of Contents (17 chapters)
1
Part 1: Introduction to Data Observability
4
Part 2: Implementing Data Observability
8
Part 3: How to adopt Data Observability in your organization
12
Part 4: Appendix

Fundamentals of Data Quality Monitoring

Welcome to the exciting world of Data Observability for Data Engineering!

As you open the pages of this book, you will embark on a journey that will immerse you in data observability. The knowledge within this book is designed to equip you, as a data engineer, data architect, data product owner, or data engineering manager, with the skills and tools necessary to implement best practices in your data pipelines.

In this book, you will learn how data observability can help you build trust in your organization. Observability provides insights directly from within the process, offering a fresh approach to monitoring. It’s a method for determining whether the pipeline is functioning properly, especially in terms of adhering to its data quality standards.

Let’s get real for a moment. In our world, where we’re swimming in data, it’s easy to feel like we’re drowning. Data observability isn’t just some fancy term – it’s your life raft. Without it, you’re flying blind, making decisions based on guesswork. Who wants to be in that hot seat when data disasters strike? Not you.

This book isn’t just another item on your reading list; it’s the missing piece in your data puzzle. It’s about giving you the superpower to spot the small issues in your data before they turn into full-blown catastrophes. Think about the cost, not just in dollars, but in sleepless nights and lost trust, when data incidents occur. Scary, right?

But here’s the kicker: data observability isn’t just about avoiding nightmares; it’s about building a foundation of trust. When your data’s in check, your team can make bold, confident decisions without that nagging doubt. That’s priceless.

Data observability is not just a buzzword – we are deeply convinced it is the backbone of any resilient, efficient, and reliable data pipeline. This book will take you on a comprehensive exploration of the core principles of data observability, the techniques you can use to develop an observability approach, the challenges faced when implementing it, and the best practices being employed by industry leaders. This book will be your compass in the vast universe of data observability by providing you with various examples that allow you to bridge the gap between theory and practice.

The knowledge in this book is organized into four essential parts. In part one, we will lay the foundation by introducing the fundamentals of data quality monitoring and how data observability takes it to the next level. This crucial groundwork will ensure you understand the core concepts and will set the stage for the next topics.

In part two, we will move on to the practical aspects of implementing data observability. You will dive into various techniques and elements of observability and learn how to define rules on indicators. This part will provide you with the skills to apply data observability in your projects.

The third part will focus on adopting data observability at scale in your organization. You will discover the main benefits of data observability by learning how to conduct root cause analysis, how to optimize pipelines, and how to foster a culture change within your team. This part is essential to ensure the successful implementation of a data observability program.

Finally, the fourth part will contain additional resources focused on data engineering, such as a data observability checklist and a technical roadmap to implement it, leaving you with strong takeaways so that you can stand on your own two feet.

Let’s start with a hypothetical scenario. You are a data engineer, coming back from your holidays and ready to start the quarter. You have a lot of new projects for the year. However, the second you reach your desktop, Lucy from the marketing team calls out to you: “The marketing report of last month is totally wrong – please fix it ASAP. I need to update my presentation!

This is annoying; all the work that’s been scheduled for the day is delayed, and you need to check the numbers. You open your Tableau dashboard and start a Zoom meeting with the marketing team. The first task of the day: understand what she meant by wrong. Indeed, the turnover seems odd. It’s time for you to have a look at the SQL database feeding the dashboard. Again, you see the same issue. This is strange and will require even more investigation.

After hours of manual and tedious checks, contacting three different teams and sending 12 emails, you finally found the culprit: an ingestion script, feeding the company’s master database, was modified to express the turnover in thousands of dollars instead of units. Because the data team didn’t know that the metric would be used by the marketing team, the information did not pass and the pipeline was fed with the wrong data.

It’s not the first time this has happened. Hours of productivity are ruined by firefighting data issues. It’s decided – you need to implement a new strategy to avoid this.

Observability is intimately correlated with the notions of data quality. The latter is often defined as a way of measuring data indicators. Data quality is one thing, but monitoring it is something else! Through this chapter, we will explore the principles of data quality and understand how those can guide you on the data observability journey and how the information bias between stakeholders is key to understanding the need for data quality and observability in the data pipeline.

Data quality comes from the need to ensure correct and sustainable data pipelines. We will look at the different stakeholders of a data pipeline and describe why they need data quality. We will also define data quality through several concepts, which will lead to you understanding how a common base can be created between stakeholders.

By the end of this chapter, you will understand how data quality can be monitored and turned into metrics, preparing the ground for data observability.

In this chapter, we’ll cover the following topics:

  • Learning about the maturity path of data in companies
  • Identifying information bias in data
  • Exploring the seven dimensions of data quality
  • Turning data quality into SLAs
  • Indicators of data quality
  • Alerting on data quality issues