Book Image

Data Observability for Data Engineering

By : Michele Pinto, Sammy El Khammal
Book Image

Data Observability for Data Engineering

By: Michele Pinto, Sammy El Khammal

Overview of this book

In the age of information, strategic management of data is critical to organizational success. The constant challenge lies in maintaining data accuracy and preventing data pipelines from breaking. Data Observability for Data Engineering is your definitive guide to implementing data observability successfully in your organization. This book unveils the power of data observability, a fusion of techniques and methods that allow you to monitor and validate the health of your data. You’ll see how it builds on data quality monitoring and understand its significance from the data engineering perspective. Once you're familiar with the techniques and elements of data observability, you'll get hands-on with a practical Python project to reinforce what you've learned. Toward the end of the book, you’ll apply your expertise to explore diverse use cases and experiment with projects to seamlessly implement data observability in your organization. Equipped with the mastery of data observability intricacies, you’ll be able to make your organization future-ready and resilient and never worry about the quality of your data pipelines again.
Table of Contents (17 chapters)
1
Part 1: Introduction to Data Observability
4
Part 2: Implementing Data Observability
8
Part 3: How to adopt Data Observability in your organization
12
Part 4: Appendix

Key components of data observability

In this section, we will see some examples of data observability metrics that are collected from inside applications and issues that can be raised from such quality issues. We will focus on detecting issues and to do so, we are going to create visuals of data observability issues in a Jupyter notebook.

If you want to follow the example, you can find it in the Chapter2 section of the GitHub repository. The name of the notebook is Visualise_Observability_Issues.ipynb.

In this part, we will focus on a timeliness, a completeness, and an accuracy issue.

The dataset that we provide is a basic example of marketing and sales data. The data represents the orders made on a web shop and consists of the following fields:

  • date: The date of the order
  • guid: A unique ID for the order
  • email: The email address linked to the order
  • page_visited: The number of pages the customer visited on the website
  • duration: How long the customer...