Driving Data Quality with Data Contracts

By : Andrew Jones

Driving Data Quality with Data Contracts

By: Andrew Jones

Overview of this book

Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1: Why Data Contracts?

Free Chapter

Chapter 1: A Brief History of Data Platforms

The enterprise data warehouse

The big data platform

The modern data stack

The state of today’s data platforms

The ever-increasing use of data in business-critical applications

Summary

Further reading

Chapter 2: Introducing Data Contracts

What is a data contract?

When to use data contracts

Data contracts and the data mesh

Summary

Further reading

Part 2: Driving Data Culture Change with Data Contracts

Chapter 3: How to Get Adoption in Your Organization

Using data contracts to change an organization

Articulating the value of your data

Building data products

Walking through an example of a data product

Summary

Further reading

Chapter 4: Bringing Data Consumers and Generators Closer Together

Who is a consumer, and who is a generator?

Assigning responsibility and accountability

Feeding data back to the product teams

Managing the evolution of data

Summary

Further reading

Chapter 5: Embedding Data Governance

Why we need data governance

Promoting data governance through data contracts

Assigning responsibility for data governance

Summary

Further reading

Part 3: Designing and Implementing a Data Architecture Based on Data Contracts

Chapter 6: What Makes Up a Data Contract

The schema of a data contract

Evolving your data over time

Defining the governance and controls

Summary

Further reading

Chapter 7: A Contract-Driven Data Architecture

A step-change in building data platforms

Introducing the principles of a contract-driven data architecture

Providing self-served data infrastructure

Summary

Further reading

Chapter 8: A Sample Implementation

Technical requirements

Creating a data contract

Providing the interfaces to the data

Creating libraries for data generators

Populating a central schema registry

Implementing contract-driven tooling

Summary

Further reading

Chapter 9: Implementing Data Contracts in Your Organization

Getting started with data contracts

Migrating to data contracts

Discovering data contracts

Building a mature data contracts-backed data culture

Summary

Further reading

Chapter 10: Data Contracts in Practice

Designing a data contract

Monitoring and enforcing data contracts

Data contract publishing patterns

Summary

Further reading

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The enterprise data warehouse

We’ll start by looking at the data architecture that was prevalent in the late 1990s and early 2000s, which was centered around an enterprise data warehouse (EDW). As we discuss the architecture and its limitations, you’ll start to notice how many of those limitations continue to affect us today, despite over 20 years of advancement in tools and capabilities.

EDW is the collective term for a reporting and analytics solution. You’d typically engage with one or two big vendors who would provide these capabilities for you. It was expensive and only larger companies that could justify the investment.

The architecture was built around a large database in the center. This was likely an Oracle or MS SQL Server database, hosted on-premises (this was before the advent of cloud services). The extract, transform, and load (ETL) process was performed on data from source systems, or more accurately, the underlying database of those systems. That data could then be used to drive reporting and analytics.

The following diagram shows the EDW architecture:

Figure 1.1 – The EDW architecture

Because this ETL ran against the database of the source system, reliability was a problem. It created a load on the database that could negatively impact the performance of the upstream service. That, and the limitations of the technology we were using at the time, meant we could do few transforms on the data.

We also had to update the ETL process as the database schema and the data evolved over time, relying on the data generators to let us know when that happened. Otherwise, the pipeline would fail.

Those who owned databases were somewhat aware of the ETL work and the business value it drove. There were few barriers between the data generators and consumers and good communication.

However, the major limitation of this architecture was the database used for the data warehouse. It was very expensive and, as it was deployed on-premises, was of a fixed size and hard to scale. That created a limit on how much data could be stored there and made available for analytics.

It became the responsibility of the ETL developers to decide what data should be available, depending on the business needs, and to build and maintain that ETL process by getting access to the source systems and their underlying databases.

And so, this is where the bottleneck was. The ETL developers had to control what data went in, and they were the only ones who could make data available in the warehouse. Data would only be made available if it met a strong business need, and that typically meant the only data in the warehouse was data that drove the company KPIs. If you wanted some data to do some analysis and it wasn’t already in there, you had to put a ticket in their backlog and hope for the best. If it did ever get prioritized, it was probably too late for what you wanted it for.

Note

Let’s illustrate how different roles worked together with this architecture with an example.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She’s aware that some of the data from that database is extracted by a data analyst, Bukayo, and that is used to drive top-level business KPIs.

Bukayo can’t do much transformation on the data, due to the limitations of the technology and the cost of infrastructure, so the reporting he produces is largely on the raw data.

There are no defined expectations between Vivianne and Bukayo, and Bukayo relies on Vivianne telling him in advance whether there are any changes to the data or the schema.

The extraction is not reliable. The ETL process could affect the performance of the database, and so can be switched off when there is an incident. Schema and data changes are not always known in advance. The downstream database also has limited performance and cannot be easily scaled to deal with an increase in the data or usage.

Both Vivianne and Bukayo lack autonomy. Vivianne can’t change her database schema without getting approval from Bukayo. Bukayo can only get a subset of data, with little say over the format. Furthermore, any potential users downstream of Bukayo can only access the data he has extracted, severely limiting the accessibility of the organization’s data.

This won’t be the last time we see a bottleneck that prevents access to, and the use of, quality data. Let’s look now at the next generation of data architecture and the introduction of big data, which was made possible by the release of Apache Hadoop in 2006.

Driving Data Quality with Data Contracts

By : Andrew Jones

Driving Data Quality with Data Contracts

By: Andrew Jones

Overview of this book

Related Content you might be interested in

Current Title:

Driving Data Quality with Data Contracts

Principles of Data Fabric

Data Stewardship in Action

Data Observability for Data Engineering

The enterprise data warehouse