Missing data is a menace! It pops up out of nowhere and blocks analysis until it is properly taken care of. The statistical technique of the expectation-maximization algorithm, or simply the EM algorithm, needs a lot of information on the probability distributions, structural relationship, and in-depth details of statistical models. However, an approach using the EM algorithm is completely ruled out here. Random forests can be used to overcome the missing data problem.
We will use the missForest
R package to fix the missing data problem whenever we come across it in the rest of the book. The algorithm for the missForest
function and other details can be found at https://academic.oup.com/bioinformatics/article/28/1/112/219101. For any variable/column with missing data, the technique is to build a random forest for that variable and obtain the OOB prediction as the imputation error estimates. Note that the function can handle continuous as well as categorical missing...