Book Image

Practical Data Quality

By : Robert Hawker
Book Image

Practical Data Quality

By: Robert Hawker

Overview of this book

Poor data quality can lead to increased costs, hinder revenue growth, compromise decision-making, and introduce risk into organizations. This leads to employees, customers, and suppliers finding every interaction with the organization frustrating. Practical Data Quality provides a comprehensive view of managing data quality within your organization, covering everything from business cases through to embedding improvements that you make to the organization permanently. Each chapter explains a key element of data quality management, from linking strategy and data together to profiling and designing business rules which reveal bad data. The book outlines a suite of tried-and-tested reports that highlight bad data and allow you to develop a plan to make corrections. Throughout the book, you’ll work with real-world examples and utilize re-usable templates to accelerate your initiatives. By the end of this book, you’ll have gained a clear understanding of every stage of a data quality initiative and be able to drive tangible results for your organization at pace.
Table of Contents (16 chapters)
Part 1 – Getting Started
Part 2 – Understanding and Monitoring the Data That Matters
Part 3 – Improving Data Quality for the Long Term

The value of this book

I realize it is never easy to find the time to read a book like this one. There are so many business books you could read to improve your performance and that of your organization. Most people have started to read a number of similar business books and never made it all the way through.

So, why invest your valuable time in this one? I hope that I will help you understand which of your data is bad, which of that data matters, how to get that data quickly from bad to good enough – and to keep it there. This is the meaning of Practical Data Quality.

The approach outlined in this book helped take an organization that had such poor data that it was literally struggling to keep the lights on in its premises, to a point where data quality was considered a strength. (This organization had such poor data that it could not get payments to its utility providers and very nearly had a power supply suspension.)

The rate of progression was high. In just weeks, data quality improvements were made for the highest priority issues. Within 6 months, an automated data quality tool was in place to identify data that did not meet business needs, and processes were in place to correct the data. After two years, data quality was fully embedded in organizational processes, with new employees given training on the topic and data quality scores close to 100% of the targets. If you follow the approach in this book and you have the right support from your organization, you should be able to achieve similar results.

Importance of executive support

I firmly believe that the approach in this book is the right one. However, even the right approach can fail without the right support from executives.

In the example organization, the support that was required was relatively easy to obtain. The situation had been so bad that the leadership team could see that data quality was a major issue that was affecting revenue, costs, and compliance matters – the three topic areas that typically capture the interest of executive boards.

The data quality team was asked to report on data quality monthly to the board and every time a concern or blocker was raised, actions were immediately defined to move them out of the way.

In most organizations, data quality issues are not so severe that their impact is plain to see right up to the executive level. The issues are well known to those on the front line of the business, but people work hard to smooth the rough edges of the data before it reaches the executives. Processes and compliance activities are impacted, but not severely enough to cause a complete breakdown that executives will become aware of. Business and IT executives often have different priorities and different languages when talking about data and data teams must often bridge these divides.

The following chapters will outline an approach that will help you surface these issues in a way that will influence executives to support data quality initiatives.

The remainder of this chapter will cover the following main topics:

  • What is bad data?
  • The impacts of bad data quality
  • The typical causes of bad data quality

What is bad data?

The first topic is about defining what is meant by bad data. It rarely makes sense to aim for what people might consider perfect data (every record is complete, accurate, and up to date). The investment required is usually prohibitive, and the gains made for the last 1% of data quality improvement effort become far too marginal.

Detailed definition of bad data

What do I mean by bad data?

In summary, this is the point where the data no longer supports the objectives of the business. To drill into this in more detail, it is where the following occurs:

  • Data issues prevent business processes from being completed in the following ways:
    • On-time (for example, within service-level agreements (SLAs))
    • Within budget (for example, the headcount budget has to be exceeded to keep up with agreed time constraints)
    • With appropriate outcomes (for example, products delivered on time)
  • Data issues mean key information is not available to support business decisions at the time it is required. This can be because of the following challenges:
    • Missing or delayed information (for example, selecting products to discontinue based on profit margins, but no margins are available for key products in reporting)
    • Incorrect information (for example, competitor margin is presumed to be X% but is 5% lower than this presumption in reality, due to an error in data aggregation)
  • Data issues cause a compliance risk – this can be where the following occurs:
    • Data that must be provided to a regulator is not available, is incomplete, incorrect, or is delayed beyond a regulatory deadline
    • Data is not retained as per privacy laws – such as the General Data Protection Regulation (GDPR) in the EU
  • Data does not allow the business to differentiate itself from its competitors where data is sold as a product (for example, a database of customer data) or as part of a differentiated customer experience.

Data that contributes to any of these types of issues to the point that business objectives cannot be met would be considered bad by this definition.

The level of data quality is rarely consistent across business units and locations within a company. There are usually pockets of excellence and areas where data has become a major problem. Often, the overall progress of a business toward its objectives can be seriously impacted by significant failures within just one business unit or location.

One organization I worked with had a strongly differentiated product that was achieved through great R&D and thoughtful acquisition activity. The R&D team carefully managed their data and kept the quality high enough to achieve their business objectives. The Operations team was less mature in their management of data, but their data quality issues were not severe enough to prevent them from meeting their main objectives. They still managed to produce enough of their differentiated product for the organization to predict extremely high sales growth. However, the Commercial team had inherited low-quality customer master data (heavily duplicated, incorrect, or missing shipping details primarily) from an acquisition, and some of the possible sales growth was not achieved. As part of a customer experience review, a major customer commented, “you can have the best products in the marketplace, but if it becomes hard to do business with you, it doesn’t matter.”

Bad data versus perfect data

We already mentioned that the investment to get to perfect data rarely makes economic sense. Having bad data does not make economic sense either. So, how should organizations decide on what standard of data is fit for purpose?

The answer is complex and will be covered in more depth in Chapter 6, in the Key features of data quality rules section, but in summary, you must define a threshold at which you deem the data to be fit for purpose. This is the point where the data allows you to achieve your business objectives.

The trick is to make sure that the thresholds you define are highly specific. For example, most people would consider a tax ID for a supplier to be a mandatory element of data. It is tempting to target a data quality score of 100% (in other words, every row of data is perfect) for data like this, but in reality, thinking must be much more nuanced.

In many countries, small organizations will not have a tax ID. In the UK, for example, it is optional to register for VAT until company revenue reaches £85,000 (as of 2022). This means that the field in a system that contains this data cannot be made mandatory when collecting the data. A data quality threshold has to be set at which data will be considered fit for purpose.


To manage this truly effectively, you would segregate the vendors into large enterprises and smaller organizations. You would set a high threshold (for example, 95%) for large enterprises, and a much lower threshold (for example, 60%) for smaller organizations.

To get this rule perfect, you might even try to capture (or import from a source such as the Dun and Bradstreet database) the average annual revenue for the past 3 years for a supplier when adding them to your system. You would then specify a high threshold for those who had revenue over the tax registration level. This would be a time-consuming rule to create and manage because you would need to capture a lot of additional data and the thresholds would change over time. This is where judgment comes in when defining data quality rules – is the benefit you will gain on making the rule specific worth the effort to obtain/maintain the information you need?

If you are not specific enough with your targets, data may be flagged as bad inappropriately. When tasked with correcting it, your colleagues will notice false negatives and lose faith in the data quality reporting you are providing them with. In this example, a supplier is being chased for a tax ID, only to find they do not have one. These false negatives are damaging because the people involved in your data quality initiative start to feel they can ignore the data failures – it is the classic “boy who cried wolf” tale in a data quality context.

Now that we have introduced the basics of bad data, let's understand how this bad data can impact an organization.