Book Image

Data Observability for Data Engineering

By : Michele Pinto, Sammy El Khammal
Book Image

Data Observability for Data Engineering

By: Michele Pinto, Sammy El Khammal

Overview of this book

In the age of information, strategic management of data is critical to organizational success. The constant challenge lies in maintaining data accuracy and preventing data pipelines from breaking. Data Observability for Data Engineering is your definitive guide to implementing data observability successfully in your organization. This book unveils the power of data observability, a fusion of techniques and methods that allow you to monitor and validate the health of your data. You’ll see how it builds on data quality monitoring and understand its significance from the data engineering perspective. Once you're familiar with the techniques and elements of data observability, you'll get hands-on with a practical Python project to reinforce what you've learned. Toward the end of the book, you’ll apply your expertise to explore diverse use cases and experiment with projects to seamlessly implement data observability in your organization. Equipped with the mastery of data observability intricacies, you’ll be able to make your organization future-ready and resilient and never worry about the quality of your data pipelines again.
Table of Contents (17 chapters)
1
Part 1: Introduction to Data Observability
4
Part 2: Implementing Data Observability
8
Part 3: How to adopt Data Observability in your organization
12
Part 4: Appendix

Indicators of data quality

Once objectives have been defined, we need a way to assess their validity and follow up on the quality of data on a day-to-day basis.

To do so, indicators are must-haves. Indeed, an SLO will be useless without any indicator. Those indicators can also be called service-level indicators (SLIs). An indicator is a defined measure of data quality. An indicator can be a metric, a state, or a key performance indicator (KPI). At this stage, data quality is activated and becomes data quality monitoring. The goal of the indicator is to assess the validity of your objectives. It is the producer’s responsibility to check whether an indicator behaves well.

Depending on the objective, many indicators can be put in place. If a data application expects JSON as input, the format of the incoming data source becomes an important indicator. We will see techniques and methods for gathering these indicators in Chapter 3 and learn how a data model can be used to collect those elements in Chapter 4.

Data source metadata

We define metadata as data on data. It can be the file location, the format, or even the owner of the dataset. These can be pertinent indicators for executing data applications. If a program expects CSV file format to be triggered but the format has changed upstream to JSON, the objectives can be jeopardized.

Schema

The expected schema of a data source is an important indicator. An application or a pipeline can simply break if the input schema is not correct. The important characteristics of a schema are its field names and field types.

A wrong type, or a deleted column, can lead to missing values in the outcome, or broken applications. Sometimes, the issue is even more sneaky. A Python AI model will only require a feature matrix without column names. If two columns are interchanged, the model will use the wrong value associated with its coefficients and will lead to misleading results.

Lineage

Lineage or data mapping at the application level allows you to express the transformations of data you need in the process. This lineage permits an overview of the data field’s usage. Monitoring this lineage can prevent any change in the code base of the application.

Data lineage describes the data flow inside the application. It is the best documentation about what’s happening inside each step of a pipeline. This lineage allows you to create a data usage catalog, where you can see who accessed the data, how it was used, and what was created out of it.

Thanks to lineage, you can manage the risk of modifying the application over time as you can easily understand which other applications, data sources, and users will be impacted.

Lineage is also a good indicator if you wish to know if what’s happening inside the application can be done or not. Imagine a data pipeline in a bank that grants loans to its customers. The project team may choose to rely on the automatic decision of an AI model. For ethical and GDPR reasons, you do not want any of these decisions to rely on the gender of the customer. However, the feature matrix can be a mix of scores computed from other data sources. Thanks to lineage, you can validate how the feature matrix was created, and avoid any misuse of gender data you have in the data lake.

Application

Information about the application itself is also neglected but an important indicator. The code version (or tool version), as well as the timestamp of execution, can be valuable for detecting data quality issues. An application is a tool, a notebook, a script, or a piece of code that modifies the data. It is fueled by inputs and produces outputs.

An application running a wrong code version may use outdated data, leading to a timeliness issue. This is a perfect example of indicators that will serve a technical objective without a direct link to the agreement. The code version is often out of the scope of consideration for the business team, while it can have a lot of impact on the outcome.

Statistics and KPIs

Broadly speaking, there are many metrics – generic or not – that can become indicators in the context of data quality. Here, we draw a distinction between statistics and KPIs.

Statistics are a list of predefined metrics that you can compute on a dataset. KPIs are custom metrics that are often related to the accuracy of the dataset, which can also relate to a combination of datasets. Let’s deep dive into some of the main statistics as indicators:

  • Distribution: There are many ways to compute the distribution of a dataset feature. For numerical data, the minimum, maximum, mean, median, and other quantiles can be very good indicators. If the machine learning model is very sensitive to the distribution, skewness and kurtosis can also be considered to offer a better view of the shape of the data. For categorical data, the mode and frequency are valuable indicators.
  • Freshness: The freshness of a dataset is defined by several time-based metrics to define whether the data is fresh. This is done by using a combination of the frequency and the timeliness of the dataset:
    • Frequency: This metric tells us at what time the dataset was updated or used and allows us to check whether the data was solicited on time. If the dataset is expected to be updated every day at noon and is late, data processes may be wrong if they’re launched before the availability of new data rows.
    • Timeliness: This is a measure of the obsolescence of data. It can be computed by checking the timestamp of the data and the timestamp of the process using it. A reporting process could show no errors while the underlying data is outdated. If you want to be sure that the data you are using is last week’s data, you can use a timeliness indicator.
  • Completeness: Two indicators can help you check whether data is complete or not. The first one is the volume of data, while the second one is the missing values indicator:
    • Volume: This is an indicator of the number of rows being processed in the dataset. The fact that the volume of data drops or sharply increases may express an underlying issue, and can signal an issue upstream or a change in the data collection process. If you expect to target 10 million customers but the query returns only 5,000, you may be facing a volume issue.
    • Missing values: This is an indicator of the number or the percentage of null rows or null values in the dataset. A small percentage of missing values can be tolerated by the business team. However, for technical reasons, some machine learning models do not allow any missing values and will then return errors when running.

KPIs are custom metrics that can give an overview of the specific needs of the consumers. They can be business-centric or technical-driven. These KPIs are tailored to work with the expectations of the consumers and to ensure the robustness of the pipeline.

If the objective is to furnish accurate data for the quantity sold, you must set an indicator on the total, at which point you could create a Boolean value that will equal 1 if all the values are positive. This business-driven KPI will guarantee the accuracy of the data item.

A technical KPI can be, for instance, a difference in the number of rows of the inputs and outputs. To ensure the data is complete, the result of a full join operation should ensure that you end up with at least the same number of rows as the input data sources. If the difference is negative, you can suspect an issue in the completeness of the data.

A KPI can also be used to assess other dimensions of data quality. For instance, consistency is a dimension that often requires us to compare several data source items across the enterprise filesystem. If, in a Parquet partition, the amount column has three decimals but only one in another partition, a KPI can be computed and provide a score that can be 0 if the format of the data is not consistent.

Examples of SLAs, SLOs, and SLIs

Going back to our previous example, we defined the SLOs as follows:

  • The data needs to contain transactions from the last 2 weeks
  • The data needs to contain transactions of customers who are 18 and over

To define indicators on those SLOs, we can start by analyzing the dimensions we need to cover. In this example, the SLOs are related to the completeness, timeliness, and relevance of the data. Therefore, this is how you can set up SLIs:

  1. Firstly, there is a need to send data promptly. A timeliness indicator can be introduced, making sure you use the data from the two latest weeks at the time of execution. You have to measure the difference between the latest data point and the current date. If the difference is within 14 days, you meet the SLI.
  2. Secondly, to make sure the data is complete, you can compute the percentage of missing values of the dataset you create. A threshold can be added for the acceptable percentage of missing values, if any.
  3. Thirdly, a KPI can be added to ensure the data contains 18+ consumers. For instance, it can be a volume indicator based on the condition that the age column is greater than 18. If this indicator remains 0, you meet the SLI.
  4. Indicators can also be put on the schema to assure the consumer that the transaction amount is well defined and always a float number.

Now that you have defined proper indicators, it is high time you learn how to activate them. This is the essence of monitoring: being alerted about discovered issues. Configuring alerts to notify the relevant stakeholders whenever an SLI fails will help you detect and address issues promptly. Let’s see how this can be done with SLIs.