Book Image

Driving Data Quality with Data Contracts

By : Andrew Jones
Book Image

Driving Data Quality with Data Contracts

By: Andrew Jones

Overview of this book

Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.
Table of Contents (16 chapters)
1
Part 1: Why Data Contracts?
4
Part 2: Driving Data Culture Change with Data Contracts
8
Part 3: Designing and Implementing a Data Architecture Based on Data Contracts

The state of today’s data platforms

The limitations of today’s data architectures, and the data culture they reinforce, result in several problems that are felt almost universally by organizations trying to get value from their data. Let’s explore the following problems in turn and the impact they have:

  • The lack of expectations
  • The lack of reliability
  • The lack of autonomy

The lack of expectations

Users working with source data that has been ingested through an ELT or CDC tool can have very few expectations about what the data is, how it should be used, and how reliable it will be. They also don’t know exactly where this data comes from, who generated it, and how it might change in the future.

In the absence of explicitly defined expectations, users tend to make assumptions that are more optimistic than reality, particularly when it comes to the reliability and availability of the data. This only increases the impact when there is a breaking change in the upstream data, or when that data proves to be unreliable.

It also leads to the data not being used correctly. For example, there could be different tables and columns that relate to the various dimensions around how a customer is billed for their use of the company’s products, and this will evolve over time. The data consumer will need to know that in detail if they are to use this data to produce revenue numbers for the organization. They therefore need to gain in-depth knowledge of the service and the logic it uses so they can reimplement that in their ETL.

Successfully building applications and services on top of the data in our lakehouse would require the active transfusion of this knowledge from the upstream data generators to the downstream data consumers, including the following:

  • The domain models the dataset describes
  • The change history of the dataset
  • The schematics and metadata

However, due to the distance between these groups, there is no feasible way to establish this exchange.

This lack of expectations, and no requirement to fulfill them, is also a problem for the data generators. Often, they don’t even know they are data generators, as they are just writing data to their internal models in their services database or managing a third-party service as best they can to meet their direct users requirements. They are completely unaware of the ELT/CDC processes running to extract their data and its importance to the rest of the organization. This makes it difficult to hold them responsible for the changes they make and their downstream impact, as it is completely invisible to them and often completely unexpected. So, the responsibility falls entirely on the data teams attempting to make use of this data.

This lack of responsibility is shown in the following diagram, which is the same as we saw in the The modern data stack section earlier but annotated with responsibility.

Figure 1.5 – Responsibility in the modern data stack

Figure 1.5 – Responsibility in the modern data stack

This diagram also illustrates another of the big problems with today’s data platforms, which is the complete lack of collaboration between the data generators and the data consumers. The data generators are far removed from the consumption points and have little to no idea of who is consuming their data, why they need the data, and the important business processes and outcomes that are driven by that data. On the other side, the data consumers don’t even know who is generating the data they depend on so much and have no say in what that data should look like in order to meet their requirements. They simply get the data they are given.

The lack of reliability

Many organizations suffer from unreliable data pipelines and have done for years. This could be at a significant cost, with a Gartner survey (https://www.gartner.com/smarterwithgartner/how-to-stop-data-quality-undermining-your-business) suggesting these cost companies millions of dollars a year.

There are many reasons for this unreliability. It could be the lack of quality of the data when ingested, or how the quality of that data has degraded over time as it becomes stale. Or the data could be late or incomplete.

The root cause of so many of these reliability problems is that we are building on data that was not made for consumption.

As mentioned earlier, data being ingested through ELT and CDC tools can change at any time, without warning. These could be schema changes, which typically cause the downstream pipelines to fail loudly with no new data being ingested or populated until the issue has been resolved. It could also be a change to the data itself, or the logic required to use that data correctly. These are often silent failures and may not be automatically detected. The first time we might hear about the issue is when a user brings up some data, maybe as part of a presentation or a meeting, and notices it doesn’t look quite right or looks different to how it did yesterday.

Often, these changes can’t be fixed in the source system. They were made for a good reason and have already been deployed to production. That leaves the data pipeline authors to implement a fix within the pipeline, which in the best case is just pointing to another column but more likely ends up being yet another CASE statement with logic to handle the change, or another IFNULL statement, or IF DATE < x THEN do this ELSE do that. This builds and builds over time, creating ever more complex and brittle data pipelines, and further increasing their unreliability.

All the while, we’re increasing the number of applications built on this data and adding more and more complexity to these pipelines, which again further increases the unreliability.

The cost of these reliability issues is that users lose trust in the data, and once that trust is lost it’s very hard to win back.

The lack of autonomy

For decades we’ve been creating our data platforms with a bottleneck in the middle. The team, typically a central data engineering or BI engineering team, are the only ones who have the ability and the time to attempt to make use of the raw source data, with everyone else consuming their data.

Anyone wanting to have data made available to them will be waiting for that central team to prioritize that ask, with their ticket sitting in a backlog. These central teams will never have the capacity to keep up with these requests and instead can only focus on those deemed the highest priority, which are typically those data sources that drive the company KPIs and other top-level metrics.

That’s not to say the rest of the data does not have value! As we’ll discuss in the following section, it does, and there will be plenty of ways that data could be used to drive decisions or improve data-driven products across the organization. But this data is simply not accessible enough to the people who could make use of this data and therefore sits unused.

To empower a truly data-driven organization, we need to move away from the dependence on a central and limited data engineering team to an architecture that promotes autonomy, opening that dark data up to uses that will never be important enough to prioritize, but that when added up provide a lot of business value to the organization and support new applications that could be critical for its success.

This isn’t a technical limitation. Modern data lakehouses can be queried by anyone who knows SQL, and any data available in the lakehouse can be made available to any reporting tool for use by less technical users. It’s a limitation of the way we have chosen to ingest data through ELT, the lack of quality of that data, and the data culture that embodies.

As we’ll discuss in the next section, organizations are looking to gain a competitive advantage with the ever-increasing use of data in more and more business-critical applications. These limitations in our data architecture are no longer acceptable.