Missing data appears often in real-life data, sometimes randomly in random occurrences, more often because of some bias in its recording and treatment. All linear models work on complete numeric matrices and cannot deal directly with such problems; consequently, it is up to you to take care of feeding suitable data for the algorithm to process.
Even if your initial dataset does not present any missing data, it is still possible to encounter missing values in the production phase. In such a case, the best strategy is surely that of dealing with them passively, as presented at the beginning of the chapter, by standardizing all the numeric variables.
Tip
As for as indicator variables, in order to passively intercept missing values, a possible strategy is instead to encode the presence of the label as 1
and its absence as -1
, leaving the zero value for missing values.
When missing values are present from the beginning of the project, it is certainly better to deal with them explicitly...