When working with real-world applications, we often must contend with missing data. H2O includes a function to impute variables using the mean, median, or mode, and optionally to do so by some other grouping variables.
To examine how to impute missing data this way, we will use the small Iris dataset on flowers. In particular, we will set the petal width and length values to missing for the species "setosa"
and then impute their values:
## setup iris data with some missing d <- as.data.table(iris) d[Species == "setosa", c("Petal.Width", "Petal.Length") := .(NA, NA)] h2o.dmiss <- as.h2o(d, destination_frame="iris_missing") h2o.dmeanimp <- as.h2o(d, destination_frame="iris_missing_imp")
First, we will do a simple mean imputation. This has to be done one column at a time:
## mean imputation missing.cols <- colnames(h2o.dmiss)[apply(d, 2, anyNA)] for (v in missing.cols) { h2o.dmeanimp <- h2o.impute(h2o.dmeanimp, column = v) }
One problem with imputing...