As mentioned in the last two chapters, analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets.
What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are:
- Many missing values (NAs)
- Misspelled names in categorical variables
- Inconsistent data coding
- Numbers in the same column being in different units
- Mis-recorded data and data entry mistakes
- Extreme outliers
Since there are an infinite number of ways that data can be messy...