Book Image

R Data Analysis Cookbook - Second Edition

By : Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan
Book Image

R Data Analysis Cookbook - Second Edition

By: Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

Overview of this book

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way. By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.
Table of Contents (14 chapters)

Removing cases with missing values

Datasets come with varying amounts of missing data. When we have abundant data, we sometimes (not always) want to eliminate the cases that have missing values for one or more variables. This recipe applies when we want to eliminate cases that have any missing values, as well as when we want to selectively eliminate cases that have missing values for a specific variable alone.

Getting ready

Download the missing-data.csv file from the code files for this chapter to your R working directory. Read the data from the missing-data.csv file, while taking care to identify the string used in the input file for missing values. In our file, missing values are shown with empty strings:

> dat <- read.csv("missing-data.csv", na.strings="") 

How to do it...

To get a data frame that has only the cases with no missing values for any variable, use the na.omit() function:

> dat.cleaned <- na.omit(dat) 

Now dat.cleaned contains only those cases from dat that have no missing values in any of the variables.

How it works...

The na.omit() function internally uses the is.na() function, that allows us to find whether its argument is NA. When applied to a single value, it returns a Boolean value. When applied to a collection, it returns a vector:

> is.na(dat[4,2]) 
[1] TRUE

> is.na(dat$Income)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[10] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

There's more...

You will sometimes need to do more than just eliminate the cases with any missing values. We discuss some options in this section.

Eliminating cases with NA for selected variables

We might sometimes want to selectively eliminate cases that have NA only for a specific variable. The example data frame has two missing values for Income. To get a data frame with only these two cases removed, use:

> dat.income.cleaned <- dat[!is.na(dat$Income),] 
> nrow(dat.income.cleaned)
[1] 25

Finding cases that have no missing values

The complete.cases() function takes a data frame or table as its argument and returns a Boolean vector with TRUE for rows that have no missing values, and FALSE otherwise:

> complete.cases(dat) 

[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
[10] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Rows 4, 6, 13, and 17 have at least one missing value. Instead of using the na.omit() function, we can do the following as well:

> dat.cleaned <- dat[complete.cases(dat),] 
> nrow(dat.cleaned)
[1] 23

Converting specific values to NA

Sometimes, we might know that a specific value in a data frame actually means that the data was not available. For example, in the dat data frame, a value of 0 for Income probably means that the data is missing. We can convert these to NA by a simple assignment:

> dat$Income[dat$Income==0] <- NA 

Excluding NA values from computations

Many R functions return NA when some parts of the data they work on are NA. For example, computing the mean or sd on a vector with at least one NA value returns NA as the result. To remove NA from consideration, use the na.rm parameter:

> mean(dat$Income) 
[1] NA

> mean(dat$Income, na.rm = TRUE)
[1] 65763.64