Book Image

Clean Data

By : Megan Squire
Book Image

Clean Data

By: Megan Squire

Overview of this book

<p>Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.</p> <p>The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.</p> <p>At the end of the book, you will be given a chance to tackle a couple of real-world projects.</p>
Table of Contents (17 chapters)
Clean Data
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

An introductory example


To get started, let's sharpen our chef's knife with a small example that integrates the six-step framework and illustrates how to tackle a few simple cleaning issues. This example uses the publicly available Enron e-mail dataset. This is a very famous dataset consisting of e-mail messages sent to, from, and between employees working at the now-defunct Enron Corporation. As part of the U.S. Government investigation into accounting fraud at Enron, the e-mails became part of the public record and are now downloadable by anyone. Researchers in a variety of domains have found the e-mails helpful for studying workplace communication, social networks, and more.

Note

You can read more about Enron and the financial scandal that led to its demise on its Wikipedia page at http://en.wikipedia.org/wiki/Enron, and you can read about the Enron e-mail corpus itself on its separate page at http://en.wikipedia.org/wiki/Enron_Corpus.

In this example, we will implement the six-step framework on a simple data science question. Suppose we want to reveal trends and patterns in e-mail usage over time within Enron Corporation. Let's start by counting messages that were sent to/from Enron employees by date. We will then show the counts visually on a graph over time.

First, we need to download the MySQL Enron corpus using the instructions at http://www.ahschulz.de/enron-email-data/. Another (backup) source for this file is https://www.cs.purdue.edu/homes/jpfeiff/enron.html. Following these instructions, we will need to import the data into a new database scheme called Enron on a MySQL server. The data is now ready to be queried using either the MySQL command-line interface or using a web-based tool such as PHPMyAdmin.

Our first count query is shown as follows:

SELECT date(date) AS dateSent, count(mid) AS numMsg
FROM message
GROUP BY dateSent
ORDER BY dateSent;

Right away, we notice that numerous e-mails have incorrect dates, for example, there are a number of dates that seem to predate or postdate the existence of the corporation (for example, 1979) or that were from years that were illogical (for example, 0001 or 2044). E-mail is old but not that old!

The following table shows an excerpt of a few of the weird lines (the complete result set is about 1300 rows long) All of these dates are formatted correctly; however, some of the dates are definitely wrong:

dateSent

numMsg

0002-03-05

1

0002-03-07

3

0002-03-08

2

0002-03-12

1

1979-12-31

6

1997-01-01

1

1998-01-04

1

1998-01-05

1

1998-10-30

3

These bad dates are most likely due to misconfigured e-mail clients. At this point, we have three choices for what to do:

  • Do nothing: Maybe we can just ignore the bad data and get away with building the line graph anyway. But, as the lowest bad date was from the year 0001 and the highest was from the year 2044, we can imagine our line graph with the 1300 tick marks on the time axis, each showing a count of 1 or 2. This graph does not sound very appealing or informative, so doing nothing will not work.

  • Fix the data: We could try to figure out what the correct date for each bad message was and produce a corrected dataset that we can then use to build our graph.

  • Throw out the affected e-mails: We can just make an informed decision to discard any e-mail that has a date that falls outside a predetermined window.

In order to decide between options 2 and 3, we will need to count how many messages will be affected using only a 1999-2002 window. We can use the following SQL:

SELECT count(*) FROM message
WHERE year(date) < 1998 or year(date) > 2002;
Result: 325

325 messages with bad dates may initially seem like a lot, but then again, they are only about 1 percent of the entire dataset. Depending on our goals, we might decide to fix these dates manually, but let's assume here that we do not mind losing 1 percent of the messages. We can proceed cautiously toward option 3, throwing out the affected e-mails. Here is the amended query:

SELECT date(date) AS dateSent, count(mid) AS numMsg
FROM message
WHERE year(date) BETWEEN 1998 AND 2002
GROUP BY dateSent
ORDER BY dateSent;

The cleaned data now consists of 1,211 rows, each with a count. Here is an excerpt of the new dataset:

dateSent

numMsg

1998-01-04

1

1998-01-05

1

1998-10-30

3

1998-11-02

1

1998-11-03

1

1998-11-04

4

1998-11-05

1

1998-11-13

2

In this example, it looks like there are two questionable dates in January 1998 and no other messages until October, at which point, the messages start coming in more regularly. This seems weird, and it also points to another issue, is it important that we have every date on the x axis, even if there were no e-mails sent that day?

If we answer yes, it is important to show every date, even those with 0 counts; this may mean going through another round of cleaning in order to produce rows that show the date with a zero.

But then again, maybe we can be more strategic about this. Whether we need to have zero values in our raw data actually depends on what tool we are using to create the graph and what type of graph it is, for example, Google Spreadsheets will build a line or bar graph that can automatically detect that there are missing dates on the x axis and will fill in zero values even if they are not given in the initial dataset. In our data, these zero values would be the mysterious missing dates from most of 1998.

The next three figures show each of these tools and how they handle zero values on a date axis. Note the long zero tails at the beginning and end of the Google Spreadsheets representation of the data shown here:

Google Spreadsheets automatically fills in any missing days with a zero.

The D3 JavaScript visualization library will do the same, filling in zero values for missing dates in a range by default, as shown in the next graph.

Tip

For a simple D3 line graph example, take a look at this tutorial: http://bl.ocks.org/mbostock/3883245.

D3 automatically fills in any missing days with a zero.

Excel also has identical date-filling behavior in its default line graph, as shown here:

Excel automatically fills in any missing days with a zero.

Next, we need to consider whether, by allowing zero values for dates, we are also making our x axis substantially longer (my count query yielded 1211 rows, but there are a total of 1822 days in the range specified, which is 1998-2002). Maybe showing zero count days might not work; if the graph is so crowded, we cannot see the gaps anyway.

To compare, we can quickly run the same data into Google Spreadsheets (you can do this in Excel or D3 too), but this time, we will only select our count column to build the graph, thereby forcing Google Spreadsheets to not show dates on the x axis. The result is the true shape of only the data that came from the database count query with no zero count days filled in. The long tails are gone, but the overall shape of the important part of the graph (the middle) remains the same:

The graph now shows only dates with one or more message.

Lucky for us, the shape of the data is similar, save for a shorter head and tail on the graph. Based on this comparison, and based on what we plan to do with the data (remember that all we wanted to do was create a simple line graph), we can feel good about our decision to move forward without specifically creating a dataset showing zero count days.

When all is said and done, the line graphs reveal that Enron had several significant peaks in e-mail traffic. The largest peaks and heaviest traffic occurred in the October and November of 2001, when the scandal broke. The two smaller peaks occurred around June 26-27 of 2001 and December 12-13 of 2000, when similar newsworthy events involving Enron transpired (one involving the California energy crisis and another involving a leadership change at the company).

If you get excited by data analysis, you probably have all sorts of cool ideas for what to do next with this data. And now that you have cleaned data, it will make your analysis tasks easier, hopefully!