R Data Science Essentials

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.

R Data Science Essentials

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with R

Reading data from different sources

Reading data from a database

Data types in R

Data preprocessing techniques

Performing data operations

Control structures in R

Bringing data to a usable format

Summary

Exploratory Data Analysis

The Titanic dataset

Descriptive statistics

Inferential statistics

Univariate analysis

Bivariate analysis

Multivariate analysis

Summary

Pattern Discovery

Transactional datasets

Apriori analysis

Support, confidence, and lift

Generating filtering rules

Plotting

Sequential dataset

Apriori sequence analysis

Understanding the results

Business cases

Summary

Segmentation Using Clustering

Datasets

Centroid-based clustering and an ideal number of clusters

Implementation using K-means

Visualizing the clusters

Connectivity-based clustering

Visualizing the connectivity

Business use cases

Summary

Developing Regression Models

Datasets

Sampling the dataset

Logistic regression

Evaluating logistic regression

Linear regression

Evaluating linear regression

Methods to improve the accuracy

Ensemble models

Summary

Time Series Forecasting

Datasets

Extracting patterns

Forecasting using ARIMA

Forecasting using Holt-Winters

Methods to improve accuracy

Summary

Recommendation Engine

Dataset and transformation

Recommendations using user-based CF

Recommendations using item-based CF

Challenges and enhancements

Summary

Communicating Data Analysis

Dataset

Plotting using the googleVis package

Creating an interactive dashboard using Shiny

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Bringing data to a usable format

We covered reading the data in R, understanding the data types, and performing various operations on the data. Now, we will see a few concepts that will be used just before an analysis or building a model. While performing an analysis, we might not need to study the entire dataset and we can just focus a subset of it, or, on the other hand, we might have to combine data from multiple data sources. These are the various concepts that will be covered in this chapter.

The most commonly used functionality will be to select the desired column from the dataset. While building the model, we will not be using all the columns in the dataset but just some of them that are more relevant. In order to select the column, we can either specify the column name or number, or simply delete the columns that are not required.

newdata <- data[c(1,5:10)]
head(newdata)
# excluding column 
newdata <- data[c(-2, -3, -4, -11)]
head(newdata) 
                   mpg drat    wt  qsec vs am gear
Mazda RX4         21.0 3.90 2.620 16.46  0  1    4
Mazda RX4 Wag     21.0 3.90 2.875 17.02  0  1    4
Datsun 710        22.8 3.85 2.320 18.61  1  1    4
Hornet 4 Drive    21.4 3.08 3.215 19.44  1  0    3
Hornet Sportabout 18.7 3.15 3.440 17.02  0  0    3
Valiant           18.1 2.76 3.460 20.22  1  0    3

In the preceding code, we first selected the column by its position. The first line of the code will select the first column and then the 5^th to 10^th column from the dataset, whereas, in the last line, the specified two columns are removed from the dataset. Both the preceding commands will yield the same result.

We can also arrive at a situation where we need to filter the data based on a condition. While building the model, we cannot create a single model for the whole of the population but we should create multiple models based on the behavior present in the population. This can be achieved by subsetting the dataset. In the following code, we will get the data of cars that have an mpg more than 25 alone:

newdata <- data[ which(data$mpg > 25), ]
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

We might also need to consider a sample of the dataset. For example, while building a regression or logistic model, we need to have two datasets—one for the training and the other for the testing. In these cases, we need to choose a random sample. This can be done using the following code:

sample <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample
                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Honda Civic      30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Porsche 914-2    26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Merc 450SLC      15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Duster 360       14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Fiat 128         32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Lotus Europa     30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Toyota Corona    21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Mazda RX4 Wag    21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

We considered a random sample of 10 rows from the dataset. Along with these, we might have to merge two different datasets. Let's see how this can be achieved. We can combine the data both row-wise as well as column-wise as follows:

sample1 <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample2 <- data[sample(1:nrow(data), 5, replace=FALSE),]
newdata <- rbind(sample1, sample2)

The preceding code is used to combine two datasets that share the same column format. Then we can combine them using the rbind function. Alternatively, if the two datasets have the same length of data but different columns, then we can combine them using the cind or merge functions:

newdata1 <- data[c(1,5:7)]
newdata2 <- data[c(8:11)]
newdata <- cbind(newdata1, newdata2)

When we have two different datasets with a common column, then we can use the merge function to combine them. On using merge, the dataset will be merged based on the common columns.

These are the essential concepts necessary to prepare the dataset for the analysis, which will be discussed in the next few chapters.

R Data Science Essentials

R Data Science Essentials

Overview of this book

Related Content you might be interested in

Current Title:

R Data Science Essentials

Bringing data to a usable format