Driving Data Quality with Data Contracts

By : Andrew Jones

Driving Data Quality with Data Contracts

By: Andrew Jones

Overview of this book

Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1: Why Data Contracts?

Free Chapter

Chapter 1: A Brief History of Data Platforms

The enterprise data warehouse

The big data platform

The modern data stack

The state of today’s data platforms

The ever-increasing use of data in business-critical applications

Summary

Further reading

Chapter 2: Introducing Data Contracts

What is a data contract?

When to use data contracts

Data contracts and the data mesh

Summary

Further reading

Part 2: Driving Data Culture Change with Data Contracts

Chapter 3: How to Get Adoption in Your Organization

Using data contracts to change an organization

Articulating the value of your data

Building data products

Walking through an example of a data product

Summary

Further reading

Chapter 4: Bringing Data Consumers and Generators Closer Together

Who is a consumer, and who is a generator?

Assigning responsibility and accountability

Feeding data back to the product teams

Managing the evolution of data

Summary

Further reading

Chapter 5: Embedding Data Governance

Why we need data governance

Promoting data governance through data contracts

Assigning responsibility for data governance

Summary

Further reading

Part 3: Designing and Implementing a Data Architecture Based on Data Contracts

Chapter 6: What Makes Up a Data Contract

The schema of a data contract

Evolving your data over time

Defining the governance and controls

Summary

Further reading

Chapter 7: A Contract-Driven Data Architecture

A step-change in building data platforms

Introducing the principles of a contract-driven data architecture

Providing self-served data infrastructure

Summary

Further reading

Chapter 8: A Sample Implementation

Technical requirements

Creating a data contract

Providing the interfaces to the data

Creating libraries for data generators

Populating a central schema registry

Implementing contract-driven tooling

Summary

Further reading

Chapter 9: Implementing Data Contracts in Your Organization

Getting started with data contracts

Migrating to data contracts

Discovering data contracts

Building a mature data contracts-backed data culture

Summary

Further reading

Chapter 10: Data Contracts in Practice

Designing a data contract

Monitoring and enforcing data contracts

Data contract publishing patterns

Summary

Further reading

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The modern data stack

Amazon Redshift was the first cloud-native data warehouse and provided a real step-change in capabilities. It had the ability to store almost limitless data at a low cost in a SQL-compatible database, and the massively parallel processing (MPP) capabilities meant you could process that data effectively and efficiently at scale.

This sounds like what we had with Hadoop, but the key differences were the SQL compatibility and the more strongly defined structure of the data. This made it much more accessible than the unstructured files on an HDFS cluster. It also presented an opportunity to build services on top of Redshift and later SQL-compatible warehouses such as Google BigQuery and Snowflake, which led to an explosion of tools that make up today’s modern data stack. This includes ELT tools such as Fivetran and Stitch, data transformation tools such as dbt, and reverse ETL tools such as Hightouch.

These data warehouses evolved further to become what we now call a data lakehouse, which brings together the benefits of a modern data warehouse (SQL compatibility and high performance with MPP) with the benefits of a data lake (low cost, limitless storage, and support for different data types).

Into this data lakehouse went all the source data we ingested from our systems and third-party services, becoming our operational data store (ODS). From here, we could join and transform the data and make it available to our EDW, from where it is available for consumption. But the data warehouse was no longer a separate database – it was just a logically separate area of our data lakehouse, using the same technology. This reduced the effort and costs of the transforms and further increased the accessibility of the data.

The following diagram shows the reference architecture of the modern data stack, with the data lakehouse in the center:

Figure 1.3 – The modern data stack architecture

This architecture gives us more options to ingest the source data, and one of those is using change data capture (CDC) tooling, for which we have open source implementations such as Debezium and commercial offerings such as Striim and Google Cloud Datastream, as well as in-depth write-ups on closed source solutions at organizations including Airbnb (https://medium.com/airbnb-engineering/capturing-data-evolution-in-a-service-oriented-architecture-72f7c643ee6f) and Netflix (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b). CDC tools connect to the transactional databases of your upstream servers and capture all the changes that happen to each of the tables (i.e., the INSERT, UPDATE, and DELETE statements run against the database). These are sent to the data lakehouse, and from there, you can recreate the database in the lakehouse with the same structure and the same data.

However, this creates a tight coupling between the internal models of the upstream service and database and the data consumers. As that service naturally evolves over time, breaking changes will be made to those models. When these happen – often without any notice – they impact the CDC service and/or downstream data uses, leading to instability and unreliability. This makes it impossible to build on this data with any confidence.

The data is also not structured well for analytical queries and uses. It has been designed to meet the needs of the service and to be optimal for a transactional database, not a data lakehouse. It can take a lot of transformation and joining to take this data and produce something that meets the requirements of your downstream users, which is time-consuming and expensive.

There is often little or no documentation for this data, and so to make use of it you need to have in-depth knowledge of those source systems and the way they model the data, including the history of how that has evolved over time. This typically comes from asking teams who work on that service or relying on institutional knowledge from colleagues who have worked with that data before. This makes it difficult to discover new or useful datasets, or for a new consumer to get started.

The root cause of all these problems is that this data was not built for consumption.

Many of these same problems apply to data ingested from a third-party service through an ELT tool such as Fivetran or Stitch. This is particularly true if you’re ingesting from a complex service such as Salesforce, which is highly customizable with custom objects and fields. The data is in a raw form that mimics the API of the third-party service, lacks documentation, and requires in-depth knowledge of the service to use. Like with CDC, it can still change without notice and requires a lot of transformation to produce something that meets your requirements.

One purported benefit of the modern data stack is that we now have more data available to us than ever before. However, a 2022 report from Seagate (https://www.seagate.com/gb/en/our-story/rethink-data/) found that 68% of the data available to organizations goes unused. We still have our dark data problem from the big data era.

The introduction of dbt and similar tools that run on a data lakehouse has made it easier than ever to process this data using just SQL – one of the most well-known and popular languages around. This should increase the accessibility of the data in the data lakehouse.

However, due to the complexity of the transforms required to make use of this data and the domain knowledge you must build up, we still often end up with a central team of data engineers to build and maintain the hundreds, thousands, or even tens of thousands of models required to produce data that is ready for consumption by other data practitioners and users.

Note

We’ll return to our example for the final time to illustrate how different roles work together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that the data from that database is extracted in a raw form through a CDC service. Certainly, she doesn’t know why.

Ben is a data platform engineer who works on the CDC pipeline. He aims to extract as much of the data as possible into the data lakehouse. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is an analytics engineer building dbt pipelines. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These dbt pipelines now number in the thousands and no one has all the context required to debug them all. The pipelines break regularly, and those breakages often have a wide impact.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding or visibility of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs have access to the data in a structured form, but lack the domain knowledge to use it effectively. They lack the autonomy to ask for the data they need to meet their requirements.

So, despite the improvements in the technology and architecture over three generations of data platform architectures, we still have that bottleneck of a central team with a long backlog of datasets to make available to the organization before we can start using it to drive business value.

The following diagram shows the three generations side by side, with the same bottleneck highlighted in each:

Figure 1.4 – Comparing the three generations of data platform architectures

It’s that bottleneck that has led us to the state of today’s data platforms and the trouble many of us face when trying to generate business value from our data. In the next section, we’re going to discuss the problems we have when we build data platforms on this architecture.

Driving Data Quality with Data Contracts

By : Andrew Jones

Driving Data Quality with Data Contracts

By: Andrew Jones

Overview of this book

Related Content you might be interested in

Current Title:

Driving Data Quality with Data Contracts

Principles of Data Fabric

Data Stewardship in Action

Data Observability for Data Engineering

The modern data stack