Cleaning is not only the most important but also the least glamorous phase of data analysis. With Haskell and the power of regular expressions, we can quickly identify areas with large quantities of data that need our attention. We left our cleaning problem incomplete in this chapter. There is still plenty of data left to clean. The Gender
and State
columns need some serious work. They are left as an exercise for you to learn how to craft regular expressions to quickly identify the fields that require your attention.
We also discussed the unclear border between what is meant by the terms, structured data and unstructured data. I applied two pieces of criteria for structured data—the data is in a machine-readable format and the data adheres to a metadata document standard. Our example dataset is still a long way from being structured. We assume that the person who aggregated this data had a metadata document in mind, but that didn't stop us from performing a lot of cleaning.
Our next...