Book Image

Clean Data

By : Megan Squire
Book Image

Clean Data

By: Megan Squire

Overview of this book

<p>Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.</p> <p>The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.</p> <p>At the end of the book, you will be given a chance to tackle a couple of real-world projects.</p>
Table of Contents (17 chapters)
Clean Data
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

A fresh perspective


We recently read that The New York Times called data cleaning janitor work and said that 80 percent of a data scientist's time will be spent doing this kind of cleaning. As we can see in the following figure, despite its importance, data cleaning has not really captured the public imagination in the same way as big data, data mining, or machine learning:

Who can blame us for not wanting to gather in droves to talk about how fun and super-cool janitor work is? Well, unfortunately—and this is true for actual housekeeping chores as well—we would all be a lot better off if we just got the job done rather than ignoring it, complaining about it, and giving it various demeaning names.

Not convinced yet? Consider a different metaphor instead, you are not a data janitor; you are a data chef. Imagine you have been handed a market basket overflowing with the most gorgeous heirloom vegetables you have ever seen, each one handpicked at the peak of freshness and sustainably produced on an organic farm. The tomatoes are perfectly succulent, the lettuce is crisp, and the peppers are bright and firm. You are excited to begin cooking, but you look around and the kitchen is filthy, the pots and pans have baked-on, caked-on who-knows-what, and, as for tools, you have nothing but a rusty knife and a soggy towel. The sink is broken and you just saw a beetle crawl out from under that formerly beautiful lettuce.

Even a beginner chef knows you should not cook in a place like this. At the very least, you will destroy that perfectly good delicious basket of goodies you have been given. And at worst, you will make people sick. Plus, cooking like this is not even fun, and it would take all day to chop the veggies with an old rusty knife.

Just as you would in a kitchen, it's definitely worth spending time cleaning and preparing your data science workspace, your tools, and your raw materials upfront. The old computer programming adage from the 1960s—garbage in, garbage out—is also true with data science.