Once the files are ready, load the Hmisc package and read the files as follows:
- Load the CSV data from the files:
> housing.dat <- read.csv("housing-with-missing-value.csv",header = TRUE, stringsAsFactors = FALSE)
- Check summary of the dataset:
> summary(housing.dat)
The output would be as follows:
- Delete the missing observations from the dataset, removing all NAs with list-wise deletion:
> housing.dat.1 <- na.omit(housing.dat)
Remove NAs from certain columns:
> drop_na <- c("rad")
> housing.dat.2 <-housing.dat [complete.cases(housing.dat [ , !(names(housing.dat)) %in% drop_na]),]
- Finally, verify the dataset with summary statistics:
> summary(housing.dat.1$rad)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 5.000 9.599 24.000 24.000
> summary(housing.dat.1$ptratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.60 17.40 19.10 18.47 20.20 22.00
> summary(housing.dat.2$rad)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 4.000 5.000 9.599 24.000 24.000 35
> summary(housing.dat.2$ptratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.60 17.40 19.10 18.47 20.20 22.00
- Delete the variables that have the most missing observations:
# Deleting a single column containing many NAs
> housing.dat.3 <- housing.dat$rad <- NULL
#Deleting multiple columns containing NAs:
> drops <- c("ptratio","rad")
>housing.dat.4 <- housing.dat[ , !(names(housing.dat) %in% drops)]
Finally, verify the dataset with summary statistics:
> summary(housing.dat.4)
- Load the library:
> library(Hmisc)
- Replace the missing values with mean, median, or mode:
#replace with mean
> housing.dat$ptratio <- impute(housing.dat$ptratio, mean)
> housing.dat$rad <- impute(housing.dat$rad, mean)
#replace with median
> housing.dat$ptratio <- impute(housing.dat$ptratio, median)
> housing.dat$rad <- impute(housing.dat$rad, median)
#replace with mode/constant value
> housing.dat$ptratio <- impute(housing.dat$ptratio, 18)
> housing.dat$rad <- impute(housing.dat$rad, 6)
Finally, verify the dataset with summary statistics:
> summary(housing.dat)