Book Image

R Data Science Essentials

Book Image

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.
Table of Contents (15 chapters)
R Data Science Essentials
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Performing data operations


The following are the different data operations available in R:

  • Arithmetic operations

  • String operations

  • Aggregation operations

Arithmetic operations on the data

In this dataset, we will see the arithmetic operations performed on the data. We can perform various operations such as addition, subtraction, multiplication, division, exponentiation, and modulus. Let's see how these operations are performed in R. Let's first declare two numeric vectors:

a1 <- c(1,2,3,4,5)
b1 <- c(6,7,8,9,10)
c1 <- a1+b1
[1]  7  9 11 13 15
c1 <- b1-a1
 [1] 5 5 5 5 5
c1 <- b1*a1
 [1]  6 14 24 36 50
c1 <- b1/a1
 [1] 6.000000 3.500000 2.666667 2.250000 2.000000

Apart from those seen at the top, the other arithmetic operations are the exponentiation and modulus, which can be performed as follows, respectively:

c1 <- b1/a1
c1 <- b1 %% a1

Note that these aforementioned arithmetic operations can be performed between two or more numeric vectors of the same length.

We can also perform logical operations. In the following code, we will simply pass the values 1 to 10 to the dataset and then use the check condition to exclude the data based on the given condition. The condition actually returns the logical value; it checks all the values and returns TRUE when the condition is satisfied, or else, FALSE is returned.

x <- c(1:10)
x[(x>=8) | (x<=5)]

Having seen the various operations on variables, we will also check arithmetic operations on a matrix data. In the following code, we define two matrices that are exactly the same, and then multiply them. The resultant matrix is stored in newmat:

matdata1 <-matrix(1:25, nrow=5,ncol=5, dimnames=list(rnames, cnames))
matdata2 <-matrix(1:25, nrow=5,ncol=5, dimnames=list(rnames, cnames))
newmat <- matdata1 * matdata2
newmat

String operations on the data

R supports a number of string operations. Many of these string operations are useful in data manipulation such as subsetting a string, replacing a string, changing the case, and splitting the string into characters. Now we will try each one of them in R.

The following code is used to get a part of the original string using the substr function; we need to pass the original string along with its starting location and the end location for the substring:

x <- "The Shawshank Redemption" 
substr(x, 6, 14)
[1] "Shawshank"

The following code is used to search for a pattern in the character variables using the grep function, which searches for matches. In this function, we first pass the string that has to be found, then the second parameter will hold a vector; in this case, we specified a character vector, and the third parameter will say if the pattern is a string or regular expression. When fixed=TRUE, the pattern is a string, where as it is a regular expression if set as FALSE:

grep("Shawshank", c("The","Shawshank","Redemption"), fixed=TRUE)
 [1] 2

Now, we will see how to replace a character with another. In order to substitute a character with a new character, we use the sub function. In the following code, we replace the space with a semicolon. We pass three parameters to the following function. The first parameter will specify the string/character that has to be replaced, the second parameter tells us the new character/string, and finally, we pass the actual string:

sub("\\s",",","Hello There")
 [1] "Hello,There"

We can also split the string into characters. In order to perform this operation, we need to use the strsplit function. The following code will split the string into characters:

strsplit("Redemption", "")
 [1] "R" "e" "d" "e" "m" "p" "t" "i" "o" "n"

We have a paste function in R that will paste multiple strings or character variables. It is very useful when arriving at a string dynamically. This can be achieved using the following code:

paste("Today is", date())
[1] "Today is Fri Jun 26 01:39:26 2015"

In the preceding function, there is a space between the two strings. We can avoid this using a similar paste0 function, which does the same operation but joins without any space. This function is very similar to the concatenation operation.

We can convert a string to uppercase or lowercase using the toupper and tolower functions.

Aggregation operations on the data

We explored many of the arithmetic and string operations in R. Now, let's also have a look at the aggregation operation.

Mean

For this exercise, let's consider the mtcars dataset in R. Read the dataset to a variable and then use the following code to calculate mean for a numeric column:

data <- mtcars
mean(data$mpg) 
[1] 20.09062

Median

The median can be obtained using the following code:

med <- median(data$mpg)
paste("Median MPG:", med)
[1] "Median MPG: 19.2"

Sum

The mtcars dataset has details about various cars. Let's see what is the horsepower of all the cars in this dataset. We can calculate the sum using the following code:

hp <- sum(data$hp)
paste("Total HP:", hp)
[1] "Total HP: 4694"

Maximum and minimum

The maximum value or minimum value can be found using the max and min functions. Look at the following code for reference:

max <- max(data$mpg)
min <- min(data$mpg)
paste("Maximum MPG:", max, "and Minimum MPG:", min)
[1] "Maximum MPG: 33.9 and Minimum MPG: 10.4"

Standard deviation

We can calculate the standard deviation using the sd function. Look at the following code to get the standard deviation:

sd <- sd(data$mpg)
paste("Std Deviation of MPG:", sd)
[1] "Std Deviation of MPG: 6.0269480520891"