Big data, data mining, machine learning, and visualization—it seems like data is at the center of everything great happening in computing lately. From statisticians to software developers to graphic designers, everyone is suddenly interested in data science. The confluence of cheap hardware, better processing and visualization tools, and massive amounts of freely available data means that we can now discover trends and make predictions more accurately and more easily than ever before.
What you might not have heard, though, is that all of these data science hopes and dreams are predicated on the fact that data is messy. Usually, data has to be moved, compressed, cleaned, chopped, sliced, diced, and subjected to any number of other transformations before it is ready to be used in the algorithms or visualizations that we think of as the heart of data science.
In this chapter, we will cover:
A simple six-step process you can follow for data science, including cleaning
Helpful guidelines to communicate how you cleaned your data
Some tools that you might find helpful for data cleaning
An introductory example that shows how data cleaning fits into the overall data science process