Working with Missing Data
In all the examples so far, you have been dealing with datasets that are clean and easy to decipher. However, datasets in real world are more complicated than these. One of the many problems you may have to deal with when working with datasets is missing values.
You will further learn the specifics of preparing data in Chapter 3, SQL for Data Preparation. However, in this section, you will learn several strategies that you can use to handle missing data. Some of your strategies include the following:
- Deleting rows: If a very small number of rows (that is, less than 5% of your dataset) is missing data, then the simplest solution may be to just delete the data points from your set. This would not impact your results too much.
- Mean/median/mode imputation: If 5% to 25% of your data for a variable is missing, another option is to take the mean, median, or mode of that column and fill in the blanks with that value. It may provide a...