When you disregard cases with any missing variables, you lose useful information that the non-missing values in that case convey. You may sometimes want to impute reasonable values (those that will not skew the results of analysis very much) for the missing values.

# Replacing missing values with the mean

# Getting ready

Download the `missing-data.csv` file and store it in your R environment's working directory.

# How to do it...

Read data and replace missing values:

> dat <- read.csv("missing-data.csv", na.strings = "")

> dat$Income.imp.mean <- ifelse(is.na(dat$Income), mean(dat$Income, na.rm=TRUE), dat$Income)

After this, all the `NA` values for `Income` will be the mean value prior to imputation.

# How it works...

The preceding `ifelse()` function returns the imputed mean value if its first argument is `NA`. Otherwise, it returns the first argument.

# There's more...

You cannot impute the mean when a categorical variable has missing values, so you need a different approach. Even for numeric variables, we might sometimes not want to impute the mean for missing values. We discuss an often-used approach here.

# Imputing random values sampled from non-missing values

If you want to impute random values sampled from the non-missing values of the variable, you can use the following two functions:

rand.impute <- function(a) {

missing <- is.na(a)

n.missing <- sum(missing)

a.obs <- a[!missing]

imputed <- a

imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)

return (imputed)

}

random.impute.data.frame <- function(dat, cols) {

nms <- names(dat)

for(col in cols) {

name <- paste(nms[col],".imputed", sep = "")

dat[name] <- rand.impute(dat[,col])

}

dat

}

With these two functions in place, you can use the following to impute random values for both `Income` and `Phone_type`:

> dat <- read.csv("missing-data.csv", na.strings="")

> random.impute.data.frame(dat, c(1,2))