Book Image

R Data Analysis Cookbook - Second Edition

By : Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan
Book Image

R Data Analysis Cookbook - Second Edition

By: Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

Overview of this book

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way. By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.
Table of Contents (14 chapters)

Imputing data

Missing values are considered to be the first obstacle in data analysis and predictive modeling. In most statistical analysis methods, list-wise deletion is the default method used to impute missing values, as shown in the earlier recipe. However, these methods are not quite good enough, since deletion could lead to information loss and replacement with simple mean or median, which doesn't take into account the uncertainty in missing values.

Hence, this recipe will show you the multivariate imputation techniques to handle missing values using prediction.

Getting ready

Make sure that the housing-with-missing-value.csv file from the code files of this chapter is in your R working directory.

You should also install the mice package using the following command:

> install.packages("mice")
> library(mice)
> housingData <- read.csv("housing-with-missing-value.csv",header = TRUE, stringsAsFactors = FALSE)

How to do it...

Follow these steps to impute data:

  1. Perform multivariate imputation:
#imputing only two columns having missing values
> columns=c("ptratio","rad")

> imputed_Data <- mice(housingData[,names(housingData) %in% columns], m=5, maxit = 50, method = 'pmm', seed = 500)

  1. Generate complete data:
> completeData <- complete(imputed_Data)
  1. Replace the imputed column values with the housing.csv dataset:
> housingData$ptratio <- completeData$ptratio
> housingData$rad <- completeData$rad
  1. Check for missing values:
> anyNA(housingData)

How it works...

As we already know from our earlier recipe, the housing.csv dataset contains two columns, ptratio and rad, with missing values.

The mice library in R uses a predictive approach and assumes that the missing data is Missing at Random (MAR), and creates multivariate imputations via chained equations to take care of uncertainty in the missing values. It implements the imputation in just two steps: using mice() to build the model and complete() to generate the completed data.

The mice() function takes the following parameters:

  • m: It refers to the number of imputed datasets it creates internally. Default is five.
  • maxit: It refers to the number of iterations taken to impute the missing values.
  • method: It refers to the method used in imputation. The default imputation method (when no argument is specified) depends on the measurement level of the target column and is specified by the defaultMethod argument, where defaultMethod = c("pmm", "logreg", "polyreg", "polr").
  • logreg: Logistic regression (factor column, two levels).
  • polyreg: Polytomous logistic regression (factor column, greater than or equal to two levels).
  • polr: Proportional odds model (ordered column, greater than or equal to two levels).

We have used predictive mean matching (pmm) for this recipe to impute the missing values in the dataset.

The anyNA() function returns a Boolean value to indicate the presence or absence of missing values (NA) in the dataset.

There's more...

Previously, we used the impute() function from the Hmisc library to simply impute the missing value using defined statistical methods (mean, median, and mode). However, Hmisc also has the aregImpute() function that allows mean imputation using additive regression, bootstrapping, and predictive mean matching:

> impute_arg <- aregImpute(~ ptratio + rad , data = housingData, n.impute = 5)

> impute_arg

argImpute() automatically identifies the variable type and treats it accordingly, and the n.impute parameter indicates the number of multiple imputations, where five is recommended.

The output of impute_arg shows R² values for predicted missing values. The higher the value, the better the values predicted.

Check imputed variable values using the following command:

> impute_arg$imputed$rad