R Data Science Essentials

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.

R Data Science Essentials

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with R

Reading data from different sources

Reading data from a database

Data types in R

Data preprocessing techniques

Performing data operations

Control structures in R

Bringing data to a usable format

Summary

Exploratory Data Analysis

The Titanic dataset

Descriptive statistics

Inferential statistics

Univariate analysis

Bivariate analysis

Multivariate analysis

Summary

Pattern Discovery

Transactional datasets

Apriori analysis

Support, confidence, and lift

Generating filtering rules

Plotting

Sequential dataset

Apriori sequence analysis

Understanding the results

Business cases

Summary

Segmentation Using Clustering

Datasets

Centroid-based clustering and an ideal number of clusters

Implementation using K-means

Visualizing the clusters

Connectivity-based clustering

Visualizing the connectivity

Business use cases

Summary

Developing Regression Models

Datasets

Sampling the dataset

Logistic regression

Evaluating logistic regression

Linear regression

Evaluating linear regression

Methods to improve the accuracy

Ensemble models

Summary

Time Series Forecasting

Datasets

Extracting patterns

Forecasting using ARIMA

Forecasting using Holt-Winters

Methods to improve accuracy

Summary

Recommendation Engine

Dataset and transformation

Recommendations using user-based CF

Recommendations using item-based CF

Challenges and enhancements

Summary

Communicating Data Analysis

Dataset

Plotting using the googleVis package

Creating an interactive dashboard using Shiny

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data preprocessing techniques

The first step after loading the data to R would be to check for possible issues such as missing data, outliers, and so on, and, depending on the analysis, the preprocessing operation will be decided. Usually, in any dataset, the missing values have to be dealt with either by not considering them for the analysis or replacing them with a suitable value.

To make this clearer, let's use a sample dataset to perform the various operations. We will use a dataset about India named IndiaData. You can find the dataset at https://github.com/rsharankumar/R_Data_Science_Essentials. We will perform preprocessing on the dataset:

data <- read.csv("IndiaData.csv", header = TRUE)
#1. To check the number of Null
sum(is.na(data))
[1] 6831

After reading the dataset, we use the is.na function to identify the presence of NA in the dataset, and then using sum, we get the total number of NAs present in the dataset. In our case, we can see that a large number of rows has NA in it. We can replace the NA with the mean value or we can remove these NA rows.

The following function can be used to replace the NA with the column mean for all the numeric columns. The numeric columns are identified by the sapply(data, is.numeric) function. We will check for the cells that have the NA value, then we identify the mean of these columns using the mean function with the na.rm=TRUE parameter, where the NA values are excluded while computing the mean function:

for (i in which(sapply(data, is.numeric))) {
  data[is.na(data[, i]), i] <- mean(data[, i],  na.rm = TRUE)
}

Alternatively, we can also remove all the NA rows from the dataset using the following code:

newdata <- na.omit(data)

The next major preprocessing activity is to identify the outliers package and deal with it. We can identify the presence of outliers in R by making use of the outliers function. We can use the function outliers only on the numeric columns, hence let's consider the preceding dataset, where the NAs were replaced by the mean values, and we will identify the presence of an outlier using the outliers function. Then, we get the location of all the outliers using the which function and finally, we remove the rows that had outlier values:

install.packages("outliers")
library(outliers)

We identify the outliers in the X2012 column, which can be subsetted using the data$X2012 command:

outlier_tf = outlier(data$X2012,logical=TRUE)
sum(outlier_tf)
[1] 1

#What were the outliers
find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE)
#Removing the outliers
newdata = data[-find_outlier,]
nrow(newdata)

The column from the preceding dataset that was considered in the outlier example had only one outlier and hence we can remove this row from the dataset.

R Data Science Essentials

R Data Science Essentials

Overview of this book

Related Content you might be interested in

Current Title:

R Data Science Essentials

Data preprocessing techniques