R Data Science Essentials

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.

R Data Science Essentials

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with R

Reading data from different sources

Reading data from a database

Data types in R

Data preprocessing techniques

Performing data operations

Control structures in R

Bringing data to a usable format

Summary

Exploratory Data Analysis

The Titanic dataset

Descriptive statistics

Inferential statistics

Univariate analysis

Bivariate analysis

Multivariate analysis

Summary

Pattern Discovery

Transactional datasets

Apriori analysis

Support, confidence, and lift

Generating filtering rules

Plotting

Sequential dataset

Apriori sequence analysis

Understanding the results

Business cases

Summary

Segmentation Using Clustering

Datasets

Centroid-based clustering and an ideal number of clusters

Implementation using K-means

Visualizing the clusters

Connectivity-based clustering

Visualizing the connectivity

Business use cases

Summary

Developing Regression Models

Datasets

Sampling the dataset

Logistic regression

Evaluating logistic regression

Linear regression

Evaluating linear regression

Methods to improve the accuracy

Ensemble models

Summary

Time Series Forecasting

Datasets

Extracting patterns

Forecasting using ARIMA

Forecasting using Holt-Winters

Methods to improve accuracy

Summary

Recommendation Engine

Dataset and transformation

Recommendations using user-based CF

Recommendations using item-based CF

Challenges and enhancements

Summary

Communicating Data Analysis

Dataset

Plotting using the googleVis package

Creating an interactive dashboard using Shiny

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Performing data operations

The following are the different data operations available in R:

Arithmetic operations
String operations
Aggregation operations

Arithmetic operations on the data

In this dataset, we will see the arithmetic operations performed on the data. We can perform various operations such as addition, subtraction, multiplication, division, exponentiation, and modulus. Let's see how these operations are performed in R. Let's first declare two numeric vectors:

a1 <- c(1,2,3,4,5)
b1 <- c(6,7,8,9,10)
c1 <- a1+b1
[1]  7  9 11 13 15
c1 <- b1-a1
 [1] 5 5 5 5 5
c1 <- b1*a1
 [1]  6 14 24 36 50
c1 <- b1/a1
 [1] 6.000000 3.500000 2.666667 2.250000 2.000000

Apart from those seen at the top, the other arithmetic operations are the exponentiation and modulus, which can be performed as follows, respectively:

c1 <- b1/a1
c1 <- b1 %% a1

Note that these aforementioned arithmetic operations can be performed between two or more numeric vectors of the same length.

We can also perform logical operations. In the following code, we will simply pass the values 1 to 10 to the dataset and then use the check condition to exclude the data based on the given condition. The condition actually returns the logical value; it checks all the values and returns TRUE when the condition is satisfied, or else, FALSE is returned.

x <- c(1:10)
x[(x>=8) | (x<=5)]

Having seen the various operations on variables, we will also check arithmetic operations on a matrix data. In the following code, we define two matrices that are exactly the same, and then multiply them. The resultant matrix is stored in newmat:

matdata1 <-matrix(1:25, nrow=5,ncol=5, dimnames=list(rnames, cnames))
matdata2 <-matrix(1:25, nrow=5,ncol=5, dimnames=list(rnames, cnames))
newmat <- matdata1 * matdata2
newmat

String operations on the data

R supports a number of string operations. Many of these string operations are useful in data manipulation such as subsetting a string, replacing a string, changing the case, and splitting the string into characters. Now we will try each one of them in R.

The following code is used to get a part of the original string using the substr function; we need to pass the original string along with its starting location and the end location for the substring:

x <- "The Shawshank Redemption" 
substr(x, 6, 14)
[1] "Shawshank"

The following code is used to search for a pattern in the character variables using the grep function, which searches for matches. In this function, we first pass the string that has to be found, then the second parameter will hold a vector; in this case, we specified a character vector, and the third parameter will say if the pattern is a string or regular expression. When fixed=TRUE, the pattern is a string, where as it is a regular expression if set as FALSE:

grep("Shawshank", c("The","Shawshank","Redemption"), fixed=TRUE)
 [1] 2

Now, we will see how to replace a character with another. In order to substitute a character with a new character, we use the sub function. In the following code, we replace the space with a semicolon. We pass three parameters to the following function. The first parameter will specify the string/character that has to be replaced, the second parameter tells us the new character/string, and finally, we pass the actual string:

sub("\\s",",","Hello There")
 [1] "Hello,There"

We can also split the string into characters. In order to perform this operation, we need to use the strsplit function. The following code will split the string into characters:

strsplit("Redemption", "")
 [1] "R" "e" "d" "e" "m" "p" "t" "i" "o" "n"

We have a paste function in R that will paste multiple strings or character variables. It is very useful when arriving at a string dynamically. This can be achieved using the following code:

paste("Today is", date())
[1] "Today is Fri Jun 26 01:39:26 2015"

In the preceding function, there is a space between the two strings. We can avoid this using a similar paste0 function, which does the same operation but joins without any space. This function is very similar to the concatenation operation.

We can convert a string to uppercase or lowercase using the toupper and tolower functions.

Aggregation operations on the data

We explored many of the arithmetic and string operations in R. Now, let's also have a look at the aggregation operation.

Mean

For this exercise, let's consider the mtcars dataset in R. Read the dataset to a variable and then use the following code to calculate mean for a numeric column:

data <- mtcars
mean(data$mpg) 
[1] 20.09062

Median

The median can be obtained using the following code:

med <- median(data$mpg)
paste("Median MPG:", med)
[1] "Median MPG: 19.2"

Sum

The mtcars dataset has details about various cars. Let's see what is the horsepower of all the cars in this dataset. We can calculate the sum using the following code:

hp <- sum(data$hp)
paste("Total HP:", hp)
[1] "Total HP: 4694"

Maximum and minimum

The maximum value or minimum value can be found using the max and min functions. Look at the following code for reference:

max <- max(data$mpg)
min <- min(data$mpg)
paste("Maximum MPG:", max, "and Minimum MPG:", min)
[1] "Maximum MPG: 33.9 and Minimum MPG: 10.4"

Standard deviation

We can calculate the standard deviation using the sd function. Look at the following code to get the standard deviation:

sd <- sd(data$mpg)
paste("Std Deviation of MPG:", sd)
[1] "Std Deviation of MPG: 6.0269480520891"

R Data Science Essentials

R Data Science Essentials

Overview of this book

Related Content you might be interested in

Current Title:

R Data Science Essentials

Performing data operations

Arithmetic operations on the data

String operations on the data

Aggregation operations on the data

Mean

Median

Sum

Maximum and minimum

Standard deviation