When working with data, you will usually find that it may not always be perfect or clean in the means of missing values, outliers and similar anomalies. Handling and cleaning imperfect or so-called dirty data is part of every data scientist's daily life, and even more, it can take up to 80 percent of the time we actually deal with the data!
Dataset errors are often due to the inadequate data acquisition methods, but instead of repeating and tweaking the data collection process, it is usually better (in the means of saving money, time and other resources) or unavoidable to polish the data by a few simple functions and algorithms. In this chapter, we will cover:
Different use cases of the
na.rm
argument of various functionsThe
na.action
and related functions to get rid of missing dataSeveral packages that offer a user-friendly way of data imputation
The
outliers
package with several statistical tests for extreme valuesHow to implement Lund's outlier test on our own as...