Book Image

Cleaning Data for Effective Data Science

By : David Mertz
5 (1)
Book Image

Cleaning Data for Effective Data Science

5 (1)
By: David Mertz

Overview of this book

Data cleaning is the all-important first step to successful data science, data analysis, and machine learning. If you work with any kind of data, this book is your go-to resource, arming you with the insights and heuristics experienced data scientists had to learn the hard way. In a light-hearted and engaging exploration of different tools, techniques, and datasets real and fictitious, Python veteran David Mertz teaches you the ins and outs of data preparation and the essential questions you should be asking of every piece of data you work with. Using a mixture of Python, R, and common command-line tools, Cleaning Data for Effective Data Science follows the data cleaning pipeline from start to end, focusing on helping you understand the principles underlying each step of the process. You'll look at data ingestion of a vast range of tabular, hierarchical, and other data formats, impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features. The long-form exercises at the end of each chapter let you get hands-on with the skills you've acquired along the way, also providing a valuable resource for academic courses.
Table of Contents (8 chapters)
Preface
Free Chapter
2
PART II: The Vicissitudes of Error
4
PART IV: Ancillary Matters
5
Why subscribe?
6
Other Books You May Enjoy
7
Index

Preface

In order for something to become clean, something else must become dirty.

–Imbesi’s Law of the Conservation of Filth

Doing the Other 80% of the Work

It is something of a truism in data science, data analysis, or machine learning that most of the effort needed to achieve your actual purpose lies in cleaning your data. The subtitle of this work alludes to a commonly assigned percentage. A keynote speaker I listened to at a data science conference a few years ago made a joke—perhaps one already widely repeated by the time he told it—about talking with a colleague of his. The colleague complained of data cleaning taking up half of her time, in response to which the speaker expressed astonishment that it could be so little as 50%.

Without worrying too much about assigning a precise percentage, in my experience working as a technologist and data scientist, I have found that the bulk of what I do is preparing my data for the statistical analyses, machine learning models, or nuanced visualizations that I would like to utilize it for. Although hopeful executives, or technical managers a bit removed from the daily work, tend to have an eternal optimism that the next set of data the organization acquires will be clean and easy to work with, I have yet to find that to be true in my concrete experience.

Certainly, some data is better and some is worse. But all data is dirty, at least within a very small margin of error in the tally. Even datasets that have been published, carefully studied, and that are widely distributed as canonical examples for statistics textbooks or software libraries, generally have a moderate number of data integrity problems. Even after our best pre-processing, a more attainable goal should be to make our data less dirty; making it clean remains unduly utopian in aspiration.

By all means we should distinguish data quality from data utility. These descriptions are roughly orthogonal to each other. Data can be dirty (up to a point) but still be enormously useful. Data can be (relatively) clean but have little purpose, or at least not be fit for purpose. Concerns about the choice of measurements to collect, or about possible selection bias, or other methodological or scientific questions are mostly outside the scope of this book. However, a fair number of techniques I present can aid in evaluating the utility of data, but there is often no mechanical method of remedying systemic issues. For example, statistics and other analyses may reveal—or at least strongly suggest—the unreliability of a certain data field. But the techniques in this book cannot generally automatically fix that unreliable data or collect better data.

The code shown throughout this book is freely available. However, the purpose of this book is not learning to use the particular tools used for illustration, but to understand the underlying purpose of data quality. The concepts presented should be applicable in any programming language used for data processing and machine learning. I hope it will be easy to adapt the techniques I show to your own favorite collection of tools and programming languages.

Types of Grime

There are roughly two families of problems we find in datasets. Not every problem neatly divides into these families, or at least it is not always evident which side something falls on without knowing the root cause. But in a general way, we can think of structural problems in the formatting of data versus content problems in the actual values recorded. On the structural branch a format used to encode a dataset might simply “put values in the wrong place” in one way or another. On the content side, the data format itself is correct, but implausible or wrong values have snuck in via flawed instruments, transcription errors, numeric overflows, or through other pitfalls of the recording process.

The several early chapters that discuss “data ingestion” are much more focused on structural problems in data sources, and less on numeric or content problems. It is not always cleanly possible to separate these issues, but as a question of emphasis it makes sense for the ingestion chapters to look at structural matters, and for later chapters on anomalies, data quality, feature engineering, value imputation, and model-based cleaning to direct attention to content issues.

In the case of structural problems, we almost always need manual remediation of the data. Exactly where the bytes that make up the data go wrong can vary enormously, and usually does not follow a pattern that lends itself to a single high-level description. Often we have a somewhat easier time with the content problems, but at the same time they are more likely to be irremediable even with manual work.

Consider this small comma-separated value (CSV) data source, describing a 6th grade class:

Student#,Last Name,First Name,Favorite Color,Age
1,Johnson,Mia,periwinkle,12
2,Lopez,Liam,blue,green,13
3,Lee,Isabella,,11
4,Fisher,Mason,gray,-1
5,Gupta,Olivia,9,102
6,,Robinson,,Sophia,,blue,,12

In a friendly way, we have a header line that indicates reasonable field names and provides a hint as to the meaning of each column. Programmatically, we may not wish to work with the punctuation marks and spaces inside some field names, but that is a matter of tool convenience that we can address with the APIs (application programming interfaces; the functions and methods of a library) that data processing tools give us (perhaps by renaming them).

Let us think about each record in turn. Mia Johnson, student 1, seems to have a problem-free record. Her row has five values separated by four commas, and each data value meets our intuitive expectations about the data type and value domain. The problems start hereafter.

Liam Lopez has too many fields in his row. However, both columns 4 and 5 seem clearly to be in the lexicon of color names. Perhaps a duplicate entry occurred or the compound color “blue-green” was intended. Structurally the row has issues, but several plausible remediations suggest themselves.

Isabella Lee is perhaps no problem at all. One of her fields is empty, meaning no favorite color is available. But structurally, this row is perfectly fine for CSV format. We will need to use some domain or problem knowledge to decide how to handle the missing value.

Mason Fisher is perhaps similar to Isabella. The recorded age of -1 makes no sense in the nature of “age” as a data field, at least as we usually understand it (but maybe the encoding intends something different). On the other hand, -1 is one of several placeholder values used very commonly to represent missing data. We need to know our specific problem to know whether we can process the data with a missing age, but many times we can handle that. However, we still need to be careful not to treat the -1 as a plain value; for example, the mean, minimum, or standard deviation of ages might be thrown off by that.

Olivia Gupta starts to present a trickier problem. Structurally her row looks perfect. But “9” is probably not a string in our lexicon of color names. And under our understanding of the data concerning a 6th grade class, we don’t expect 102 year old students to be in it. To solve this row, we really need to know more about the collection procedure and the intention of the data. Perhaps a separate mapping of numbers to colors exists somewhere. Perhaps an age of 12 was mistranscribed as 102; but also perhaps a 102 year old serves as a teaching assistant in this class and not only students are recorded.

Sophia Robinson returns us to what looks like an obvious structural error. The row, upon visual inspection, contains perfectly good and plausible values, but they are separated by duplicate commas. Somehow, presumably, a mechanical error resulted in the line being formatted wrongly. However, most high-level tools are likely to choke on the row in an uninformative way, and we will probably need to remediate the issue more manually.

We have a pretty good idea what to do with these six rows of data, and even re-entering them from scratch would not be difficult. If we had a million rows instead, the difficulty would grow greatly, and would require considerable effort before we arrived at usable data.

Nomenclature

In this book I will use the terms feature, field, measurement, column, and occasionally variable more-or-less interchangeably. Likewise, the terms row, record, observation, and sample are also near synonyms. Tuple is used for the same concept when discussing databases (especially academically). In different academic or business fields, different ones of these terms are more prominent; and likewise different software tools choose among these.

Conceptually, most data can be thought of as a number of occasions on which we measure various attributes of a common underlying thing. In most tools, it is usually convenient to put these observations/samples each in a row; and correspondingly to store each of the measurements/features/fields pertaining to that thing in a column containing corresponding data for other comparable things.

Inasmuch as I vary the use of these roughly equivalent terms, it is simply better to fit with the domain under discussion and to make readers familiar with all the terms, which they are likely to encounter in various places for a similar intention. The choice among near synonyms is also guided by the predominant use within the particular tool, library, or programming community that is currently being discussed.

In many cases, a general concept has a strong overlap with the particular name a tool or library uses to implement or express that concept. Where relevant, I attempt to use the small typographic distinctions in the names to indicate focus. For example, I discuss data frames as a general paradigm for manipulating data, but refer to DataFrame when discussing Pandas or other libraries that use that spelling for the specific class used. Likewise, R’s data.frame object is a specific implementation of the paradigm, and capitalization and punctuation will be adjusted for context.

Typography

As with most programming books, code literals will be set in a fixed width font, whether as excerpts inline or as blocks of code between paragraphs. For example, a code snippet, often a name, will appear as sklearn.pipeline.Pipeline. As a block, it would appear as:

scaler = sklearn.preprocessing.RobustScaler()
scaler.fit(X)
X_scaled = scaler.transform(X_train)

Input and output within a shell will be displayed like this:

sqlite> CREATE TABLE mytable(a SMALLINT, b VARCHAR(10), c REAL);
sqlite> INSERT INTO mytable(a, b, c) VALUES('123', 456, 789);

Names of software libraries, tools, and terms that are used in a special or distinctive sense within data science are shown with a dotted underline if they’re defined in the Glossary. If not, these terms will be shown in boldface on first, or early, mention, but generally in the default typeface as common nouns elsewhere. Italics are used in places in the main text simply for emphasis of words or clauses in prose. In electronic versions of this book, underline will show that there is an embedded link to an external resource.

The names of software tools and libraries is a bit of a challenge to orthography (i.e. spelling). Capitalization, or lack thereof, is often used in a stylized way, and moreover sometimes these bits of software are rendered differently in different contexts. For example Python is a good proper name for a programming language, but the actual executable that launches a Python script is python in lower case. Tools or libraries that will usually be typed in literal form, at a command line or as a name in code, will be set in fixed width.

Still other tools have both an informal and a literal name. For example scikit-learn is stylized in lowercase, but is not the actual imported name of the library, which is sklearn. Moreover, the informal name would look out of place when referring to subpackages such as sklearn.preprocessing.

In general, the names of software libraries are actually pretty intuitive, but the Glossary lists the name variants used in slightly different contexts in this book.

aside

Sometimes, additional information or commentary is presented in asides that look like this, with superscripts in the text to mark their intended contexts.

Other times, tips, rules of thumb, and other things to remember look like this.

Taxonomy

Throughout this book, but especially in the first few chapters, I mention a large number of software tools and libraries that you might encounter in your work as a data scientist, developer, data analyst, or in another job title. The examples in the code of this book only use a relatively small fraction of those tools, mostly Python, and R, and a few libraries for those languages.

There are a much larger number of tools which you are fairly likely to encounter, and to need to use during your work. While this book does not specifically attempt to document the tools themselves, not even those tools that occur in many examples, I think it is valuable for readers to understand the general role of tools they may require in their specific tasks. When mentioning tools, I try to provide a general conceptual framework for what kind of thing that tool is, and point in the direction of the section or chapter that discusses purposes and tools most similar to it. You most certainly do not need to be familiar with any large number of the tools mentioned—potentially with none of them at all, not even the main programming languages used in examples.

The main lesson is “Don’t Panic!”, as Douglas Adams famously admonishes. You do not need to learn any specific tool discussed, but neither is any something you cannot learn when you need to or wish to. The Glossary of this book provides brief comments and definitions of terms and names used throughout this book, as well.

Included Code

In this book, I will primarily use Python and associated tools, such as Pandas, sklearn.preprocessing, and scipy.stats, to solve the data cleaning problems presented. R, and its Tidyverse tools, will often be shown as code alternatives. Some code samples will simply use Bash and the many text/data command-line processing tools available. Examples from other programming languages are occasionally mentioned, where relevant.

Quite a few additional libraries and tools are mentioned throughout this text, either only to introduce them briefly or even only to indicate they exist. Depending on your specific workplace, codebase, and colleagues, you may need to use some or all of these, even if they are not the main tools shown in this book. The Glossary describes (almost) all libraries mentioned, with brief descriptions of their purpose.

All of the code in this book is released to the Public Domain, or as Creative Commons CC0 if your jurisdiction lacks a clear mechanism for placing content in the Public Domain. The URL https://github.com/PacktPublishing/Cleaning-Data-for-Effective-Data-Science contains the code directly printed in this book, and small modules or libraries supporting the techniques demonstrated, under the same terms. All of the datasets utilized are provided at the author’s website at https://www.gnosis.cx/cleaning/. Some datasets may have different license terms, but only ones with reasonably open terms for use and modification are utilized. Because datasets are often large, this book will only reproduce directly very small datasets; I will often show a few representative sections of larger data in the text.

Running the Book

This book is itself written using Jupyter notebooks. This manner of creation allows for (almost) all the code within the book to be actively run before publication. The repository given above provides instructions and configuration files for creating a similar working environment. Code samples shown will usually be accompanied by the actual output of running them. For example, Python code:

from src.intro_students import data, cleaned
print(data)
Student#,Last Name,First Name,Favorite Color,Age
1,Johnson,Mia,periwinkle,12
2,Lopez,Liam,blue,green,13
3,Lee,Isabella,,11
4,Fisher,Mason,gray,-1
5,Gupta,Olivia,9,102
6,,Robinson,,Sophia,,blue,,12
cleaned
Student_No   Last_Name   First_Name   Favorite_Color    Age
         1     Johnson          Mia       periwinkle   12.0
         2       Lopez         Liam       blue-green   13.0
         3         Lee     Isabella        <missing>   11.0
         4      Fisher        Mason             gray    NaN
         5       Gupta       Olivia            sepia    NaN
         6    Robinson       Sophia             blue   12.0

Likewise in this configuration, I can run R code equally well. At times the code samples will show data being transferred between the R and Python kernels.

%load_ext rpy2.ipython
%%R -i cleaned
library('tibble')
# Select and rename columns
tibble(First=cleaned$First_Name, 
       Last=cleaned$Last_Name, 
       Age=cleaned$Age)
# A tibble: 6 x 3
  First    Last       Age
  <chr>    <chr>    <dbl>
1 Mia      Johnson     12
2 Liam     Lopez       13
3 Isabella Lee         11
4 Mason    Fisher     NaN
5 Olivia   Gupta      NaN
6 Sophia   Robinson    12

Command-line tools will also be shown within code cells, for example:

%%bash
sed s/,,/,/g data/students.csv |
    cut -f2,3 -d, |
    tail -n +2 |
    tr , ' ' |
    sort
Fisher Mason
Gupta Olivia
Johnson Mia
Lee Isabella
Lopez Liam
Robinson Sophia

The code in this book was run using the following versions of the main programming languages used (Python and R). Other tools like Bash, shell utilities, or Scala in one section, are also used, but the first two are very stable across versions and should not vary in behavior. The large majority of the code shown will work at least a few versions back for the main languages; most likely the code will continue to work for several versions forward (but the future is unwritten). Specific libraries used, and the number touched on is numerous, may possibly change behaviors.

import sys
sys.version
'3.9.0 | packaged by conda-forge | (default, Oct 14 2020, 22:59:50) \n[GCC 7.5.0]'
%%R
R.version.string
[1] "R version 4.0.3 (2020-10-10)"

Using this Book

Slovenliness is no part of data science...cleanliness is indeed next to godliness.

–cf. John Wesley

This book is intended to be suitable for use either by a self-directed reader or in more structured academic, training, or certification courses. Each chapter is accompanied by exercises at the bottom that ask readers or students to complete tasks related to what they just learned in the preceding material. The book repository contains additional discussion of some exercises, but will avoid presenting explicit solutions for mere copy-paste.

Instructors are encouraged to contact the author if they wish to plan course material around this book. Under a consulting arrangement, I am happy to provide solution code, suggestions on use of the exercises and other content, and so on.

The datasets and supporting materials for this book are available at the repository described above, and will be needed to engage fully with some of the more open ended problems presented. These extra materials will allow more interactive use of the book, and accompanying materials, than reading only would allow. However, sufficient explanation to understand the content based on the written material only will also be provided in the text.

Throughout this book I am strongly opinionated about a number of technical questions. I do not believe it will be difficult to distinguish my opinions from the mere facts I also present. I have worked in this area for a number of years, and I hope to share with readers the conclusions I have reached. Of course, even book authors are fallible beings, and if you decide to disagree with claims I make, I hope and wish that you will gain great benefit both from what you learn anew and what you are able to reformulate in strengthening your own opinions and conclusions.

This book does not use heavy mathematics or statistics, but there are references to concepts therein from time to time. Some concepts are described briefly in the Glossary. Readers who want to brush up on these concepts might consider these books:

  • Think Stats: Exploratory Data Analysis in Python, Allen B. Downey, 2014 (O’Reilly Media; available both in free PDF and HTML versions, and as a printed book).
  • All of Statistics: A Concise Course in Statistical Inference, Larry Wasserman, 2004 (Springer).

This book is also not focused on the ethics of data visualization, but I have tried to be conscientious in using plots, which I use throughout the text. Good texts that consider these issues include:

Data Hygiene

Throughout this book, I show you a variety of ways to modify datasets from the original versions you receive. Sometimes these transformations are between data formats or in-memory representations. At other times we impute, massage, sample, aggregate, or collate data.

Every time some transformation is made on data, we bring in certain assumptions or goals of our own; these may well be—and ideally should be—well motivated by task purpose or numeric and statistical analysis. However, they remain assumptions that could be wrong.

It is crucial to good practice of data science to version datasets as we work with them. When we draw some conclusion, or even simply when we prepare for the next transformation step, it is important to indicate which version of the data this action is based on. There are several different ways in which datasets may be versioned.

If a dataset is of moderate size, and if the transformations made are not themselves greatly time consuming, versioning within program flow is a good choice. For example, in Python-like pseudo-code:

data1 = read_format(raw_data)
data2 = transformation_1(data1)
data3 = transformation_2(data2)
# ... etc ...

When you use any version, anywhere else in a large program, it is clear from the variable name (or lookup key, etc.) which version is involved, and problems can be more easily diagnosed.

If a dataset is somewhat larger in size—to the point where keeping a number of near-copies in memory is a resource constraint—it is possible instead to track changes simply as metadata on the working dataset. This does not allow simultaneous access to multiple versions in code, but is still very useful for debugging and analysis. Again, in pseudo-code:

data = Annotated(read_format(raw_data))
inplace_transform_1(data)
data.version = "Transformed by step 1"
# ... actions on data ...
inplace_transform_2(data)
data.version = "Transformed by step 2"
# ... etc ...

At any part of an overall program, you can at least verify the version (or other metadata) associated with the dataset.

For transformations that you wish to persist longer than the run of a single program, use of version control systems (VCSs) is highly desirable. Most VCSs allow a concept of a branch where different versions of files can be maintained in parallel. If available, use of this capability is often desirable. Even if your dataset versions are strictly linear, it is possible to revert to a particular earlier version if necessary. Using accurate and descriptive commit messages is a great benefit to data versioning.

Most VCSs are intelligent about storing as few bytes as possible to describe changes. It is often possible for them to calculate a “minimal change set” to describe a transformation rather than simply storing an entirely new near-copy for each version. Whether or not your VCS does this with the formats you work with, data integrity and data provenance should be a more prominent concern than the potential need to allocate more disk space. Of late, Git is the most popular VCS; but the advice here can equally be followed using Apache Subversion, Mercurial, Perforce, Microsoft Visual SourceSafe, IBM Rational ClearCase, or any other modern VCS. Indeed, an older system like Concurrent Versions System (CVS) suffices for this purpose.

Exercises

None of the exercises throughout this book depend on using any specific programming language. In the discussion, Python is used most frequently, followed by R, with occasional use of other programming languages. But all exercises simply present one or more datasets and ask you to perform some task with that. Achieving those goals using the programming language of your choice is wonderful (subject to any constraints your instructor may provide if this book is used in formal pedagogy).

The toy tabular data on students given as an example is available at:

https://www.gnosis.cx/cleaning/students.csv

For this exercise, create a cleaned up version of the data following the assumptions illustrated in the code samples shown. Use your favorite programming language and tools, but the goal has these elements:

  • Consistent doubled commas should be read as a single delimiter.
  • Missing data in the Favorite Color field should be substituted with the string <missing>.
  • Student ages should be between 9 and 14, and all other values are considered missing data.
  • Some colors are numerically coded, but should be unaliased. The mapping is:

Number

Color

Number

Color

1

beige

6

alabaster

2

eggshell

7

sandcastle

3

seafoam

8

chartreuse

4

mint

9

sepia

5

cream

10

lemon

Using the small test dataset is a good way to test your code. But try also manually adding more rows with similar, or different, problems in them, and see how well your code produces a reasonable result. We have not discussed tools to accomplish this exercise yet, although you likely have used a programming language capable of solving it. Try to solve it now, but you can come back to this after later chapters if you prefer.

***

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Cleaning-Data-for-Effective-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801071291_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.