R Data Analysis Cookbook - Second Edition

By : Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

R Data Analysis Cookbook - Second Edition

By: Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

Overview of this book

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way. By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.

Preface

What this book covers

What you need for this book

Free Chapter

Acquire and Prepare the Ingredients - Your Data

Introduction

Working with data

Reading data from CSV files

Reading XML data

Reading JSON data

Reading data from fixed-width formatted files

Reading data from R files and R libraries

Removing cases with missing values

Replacing missing values with the mean

Removing duplicate cases

Rescaling a variable to specified min-max range

Normalizing or standardizing data in a data frame

Binning numerical data

Creating dummies for categorical variables

Handling missing data

Correcting data

Imputing data

Detecting outliers

What's in There - Exploratory Data Analysis

Introduction

Creating standard data summaries

Extracting a subset of a dataset

Splitting a dataset

Creating random data partitions

Generating standard plots, such as histograms, boxplots, and scatterplots

Generating multiple plots on a grid

Creating plots with the lattice package

Creating charts that facilitate comparisons

Creating charts that help to visualize possible causality

Where Does It Belong? Classification

Introduction

Generating error/classification confusion matrices

Principal Component Analysis

Generating receiver operating characteristic charts

Building, plotting, and evaluating with classification trees

Using random forest models for classification

Classifying using the support vector machine approach

Classifying using the Naive Bayes approach

Classifying using the KNN approach

Using neural networks for classification

Classifying using linear discriminant function analysis

Classifying using logistic regression

Text classification for sentiment analysis

Give Me a Number - Regression

Introduction

Computing the root-mean-square error

Building KNN models for regression

Performing linear regression

Performing variable selection in linear regression

Building regression trees

Building random forest models for regression

Using neural networks for regression

Performing k-fold cross-validation

Performing leave-one-out cross-validation to limit overfitting

Can you Simplify That? Data Reduction Techniques

Introduction

Performing cluster analysis using hierarchical clustering

Performing cluster analysis using partitioning clustering

Image segmentation using mini-batch K-means

Partitioning around medoids

Clustering large application

Performing cluster validation

Performing Advance clustering

Model-based clustering with the EM algorithm

Reducing dimensionality with principal component analysis

Lessons from History - Time Series Analysis

Introduction

Exploring finance datasets

Creating and examining date objects

Operating on date objects

Performing preliminary analyses on time series data

Using time series objects

Decomposing time series

Filtering time series data

Smoothing and forecasting using the Holt-Winters method

Building an automated ARIMA model

How does it look? - Advanced data visualization

Introduction

Creating scatter plots

Creating line graphs

Creating bar graphs

Making distributions plots

Creating mosaic graphs

Making treemaps

Plotting a correlations matrix

Creating heatmaps

Plotting network graphs

Labeling and legends

Coloring and themes

Creating multivariate plots

Creating 3D graphs and animation

Selecting a graphics device

This may also interest you - Building Recommendations

Introduction

Building collaborative filtering systems

Performing content-based systems

Building hybrid systems

Performing similarity measures

Application of ML algorithms - image recognition system

Evaluating models and optimization

A practical example - fraud detection system

It's All About Your Connections - Social Network Analysis

Introduction

Downloading social network data using public APIs

Creating adjacency matrices and edge lists

Plotting social network data

Computing important network metrics

Cluster analysis

Force layout

YiFan Hu layout

Put Your Best Foot Forward - Document and Present Your Analysis

Introduction

Generating reports of your data analysis with R Markdown and knitr

Creating interactive web applications with shiny

Creating PDF presentations of your analysis with R presentation

Generating dynamic reports

Work Smarter, Not Harder - Efficient and Elegant R Code

Introduction

Exploiting vectorized operations

Processing entire rows or columns using the apply function

Applying a function to all elements of a collection with lapply and sapply

Applying functions to subsets of a vector

Using the split-apply-combine strategy with plyr

Slicing, dicing, and combining data with data tables

Where in the World? Geospatial Analysis

Introduction

Downloading and plotting a Google map of an area

Overlaying data on the downloaded Google map

Importing ESRI shape files to R

Using the sp package to plot geographic data

Getting maps from the maps package

Creating spatial data frames from regular data frames containing spatial and other data

Creating spatial data frames by combining regular data frames with spatial objects

Adding variables to an existing spatial data frame

Spatial data analysis with R and QGIS

Playing Nice - Connecting to Other Systems

Introduction

Using Java objects in R

Using JRI to call R functions from Java

Using Rserve to call R functions from Java

Executing R scripts from Java

Using the xlsx package to connect to Excel

Reading data from relational databases - MySQL

Reading data from NoSQL databases - MongoDB

Working with in-memory data processing with Apache Spark

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Handling missing data

In most real-world problems, data is likely to be incomplete because of incorrect data entry, faulty equipment, or improperly coded data. In R, missing values are represented by the symbol NA (not available) and are considered to be the first obstacle in predictive modeling. So, it's always a good idea to check for missing data in a dataset before proceeding for further predictive analysis. This recipe shows you how to handle missing data.

Getting ready

R provides three simple ways to handle missing values:

Deleting the observations.
Deleting the variables.
Replacing the values with mean, median, or mode.

Install the package in your R environment as follows:

> install.packages("Hmisc")

If you have not already downloaded the files for this chapter, do it now and ensure that the housing-with-missing-value.csv file is in your R working directory.

How to do it...

Once the files are ready, load the Hmisc package and read the files as follows:

Load the CSV data from the files:

> housing.dat <- read.csv("housing-with-missing-value.csv",header = TRUE, stringsAsFactors = FALSE)

Check summary of the dataset:

> summary(housing.dat)

The output would be as follows:

Delete the missing observations from the dataset, removing all NAs with list-wise deletion:

> housing.dat.1 <- na.omit(housing.dat)

Remove NAs from certain columns:

> drop_na <- c("rad")
> housing.dat.2 <-housing.dat [complete.cases(housing.dat [ , !(names(housing.dat)) %in% drop_na]),]

Finally, verify the dataset with summary statistics:

> summary(housing.dat.1$rad)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
 1.000 4.000 5.000 9.599 24.000 24.000

> summary(housing.dat.1$ptratio)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
 12.60 17.40 19.10 18.47 20.20 22.00

> summary(housing.dat.2$rad)
 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
 1.000 4.000 5.000 9.599 24.000 24.000 35

> summary(housing.dat.2$ptratio)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
 12.60 17.40 19.10 18.47 20.20 22.00

Delete the variables that have the most missing observations:

# Deleting a single column containing many NAs
> housing.dat.3 <- housing.dat$rad <- NULL

#Deleting multiple columns containing NAs:
> drops <- c("ptratio","rad")
>housing.dat.4 <- housing.dat[ , !(names(housing.dat) %in% drops)]

Finally, verify the dataset with summary statistics:

> summary(housing.dat.4)

Load the library:

> library(Hmisc)

Replace the missing values with mean, median, or mode:

#replace with mean
> housing.dat$ptratio <- impute(housing.dat$ptratio, mean)
> housing.dat$rad <- impute(housing.dat$rad, mean)

#replace with median
> housing.dat$ptratio <- impute(housing.dat$ptratio, median)
> housing.dat$rad <- impute(housing.dat$rad, median) 

#replace with mode/constant value
> housing.dat$ptratio <- impute(housing.dat$ptratio, 18)
> housing.dat$rad <- impute(housing.dat$rad, 6)

Finally, verify the dataset with summary statistics:

> summary(housing.dat)

How it works...

When you have large numbers of observations in your dataset and all the classes to be predicted are sufficiently represented by the data points, then deleting missing observations would not introduce bias or disproportionality of output classes.

In the housing.dat dataset, we saw from the summary statistics that the dataset has two columns, ptratio and rad, with missing values.

The na.omit() function lets you remove all the missing values from all the columns of your dataset, whereas the complete.cases() function lets you remove the missing values from some particular column/columns.

Sometimes, particular variable/variables might have more missing values than the rest of the variables in the dataset. Then it is better to remove that variable unless it is a really important predictor that makes a lot of business sense. Assigning NULL to a variable is an easy way of removing it from the dataset.

In both, the given way of handling missing values through the deletion approach reduces the total number of observations (or rows) from the dataset. Instead of removing missing observations or removing a variable with many missing values, replacing the missing values with the mean, median, or mode is often a crude way of treating the missing values. Depending on the context, such as if the variation is low or if the variable has low leverage over the response/target, such a naive approximation is acceptable and could possibly give satisfactory results. The impute() function in the Hmisc library provides an easy way to replace the missing value with the mean, median, or mode (constant).

There's more...

Sometime it is better to understand the missing pattern in the dataset through visualization before taking further decision about elimination or imputation of the missing values.

Understanding missing data pattern

Let us use the md.pattern() function from the mice package to get a better understanding of the pattern of missing data.

> library(mice)
> md.pattern(housing.dat)

We can notice from the output above that 466 samples are complete, 40 samples miss only the ptratio value.

Next we will visualize the housing data to understand missing information using aggr_plot method from VIM package:

> library(VIM)
> aggr_plot <- aggr(housing.dat, col=c('blue','red'), numbers=TRUE, sortVars=TRUE, labels=names(housing.dat), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

We can understand from the plot that almost 92.1% of the samples are complete and only 7.9% are missing information from the ptratio values.

R Data Analysis Cookbook - Second Edition

By : Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

R Data Analysis Cookbook - Second Edition

By: Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

Overview of this book

Related Content you might be interested in

Current Title:

R Data Analysis Cookbook - Second Edition

Machine Learning with R Cookbook

Mastering Machine Learning with R

R Programming Fundamentals