After all that work, it looks like The New York Times was right. As you can see from this simple exercise, data cleaning indeed comprises about 80 percent of the effort of answering even a tiny data-oriented question (in this case, talking through the rationale and choices for data cleaning took 700 words out of the 900-word case study). Data cleaning really is a pivotal part of the data science process, and it involves understanding technical issues and also requires us to make some value judgments. As part of data cleaning, we even had to take into account the desired outcomes of both the analysis and visualization steps even though we had not really completed them yet.
After considering the role of data cleaning as presented in this chapter, it becomes even more obvious how improvements in our cleaning effectiveness could quickly add up to substantial time savings.
The next chapter will describe a few of the fundamentals that will be required for any "data chef" who wants to move into a bigger, better "kitchen", including file formats, data types, and character encodings.