Book Image

R Data Analysis Cookbook - Second Edition

By : Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan
Book Image

R Data Analysis Cookbook - Second Edition

By: Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

Overview of this book

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way. By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.
Table of Contents (14 chapters)

Replacing missing values with the mean

When you disregard cases with any missing variables, you lose useful information that the non-missing values in that case convey. You may sometimes want to impute reasonable values (those that will not skew the results of analysis very much) for the missing values.

Getting ready

Download the missing-data.csv file and store it in your R environment's working directory.

How to do it...

Read data and replace missing values:

> dat <- read.csv("missing-data.csv", na.strings = "") 
> dat$Income.imp.mean <- ifelse(is.na(dat$Income), mean(dat$Income, na.rm=TRUE), dat$Income)

After this, all the NA values for Income will be the mean value prior to imputation.

How it works...

The preceding ifelse() function returns the imputed mean value if its first argument is NA. Otherwise, it returns the first argument.

There's more...

You cannot impute the mean when a categorical variable has missing values, so you need a different approach. Even for numeric variables, we might sometimes not want to impute the mean for missing values. We discuss an often-used approach here.

Imputing random values sampled from non-missing values

If you want to impute random values sampled from the non-missing values of the variable, you can use the following two functions:

rand.impute <- function(a) { 
missing <- is.na(a)
n.missing <- sum(missing)
a.obs <- a[!missing]
imputed <- a
imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)
return (imputed)
}

random.impute.data.frame <- function(dat, cols) {
nms <- names(dat)
for(col in cols) {
name <- paste(nms[col],".imputed", sep = "")
dat[name] <- rand.impute(dat[,col])
}
dat
}

With these two functions in place, you can use the following to impute random values for both Income and Phone_type:

> dat <- read.csv("missing-data.csv", na.strings="") 
> random.impute.data.frame(dat, c(1,2))