Book Image

R Data Science Essentials

Book Image

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.
Table of Contents (15 chapters)
R Data Science Essentials
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Bringing data to a usable format


We covered reading the data in R, understanding the data types, and performing various operations on the data. Now, we will see a few concepts that will be used just before an analysis or building a model. While performing an analysis, we might not need to study the entire dataset and we can just focus a subset of it, or, on the other hand, we might have to combine data from multiple data sources. These are the various concepts that will be covered in this chapter.

The most commonly used functionality will be to select the desired column from the dataset. While building the model, we will not be using all the columns in the dataset but just some of them that are more relevant. In order to select the column, we can either specify the column name or number, or simply delete the columns that are not required.

newdata <- data[c(1,5:10)]
head(newdata)
# excluding column 
newdata <- data[c(-2, -3, -4, -11)]
head(newdata) 
                   mpg drat    wt  qsec vs am gear
Mazda RX4         21.0 3.90 2.620 16.46  0  1    4
Mazda RX4 Wag     21.0 3.90 2.875 17.02  0  1    4
Datsun 710        22.8 3.85 2.320 18.61  1  1    4
Hornet 4 Drive    21.4 3.08 3.215 19.44  1  0    3
Hornet Sportabout 18.7 3.15 3.440 17.02  0  0    3
Valiant           18.1 2.76 3.460 20.22  1  0    3

In the preceding code, we first selected the column by its position. The first line of the code will select the first column and then the 5th to 10th column from the dataset, whereas, in the last line, the specified two columns are removed from the dataset. Both the preceding commands will yield the same result.

We can also arrive at a situation where we need to filter the data based on a condition. While building the model, we cannot create a single model for the whole of the population but we should create multiple models based on the behavior present in the population. This can be achieved by subsetting the dataset. In the following code, we will get the data of cars that have an mpg more than 25 alone:

newdata <- data[ which(data$mpg > 25), ]
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

We might also need to consider a sample of the dataset. For example, while building a regression or logistic model, we need to have two datasets—one for the training and the other for the testing. In these cases, we need to choose a random sample. This can be done using the following code:

sample <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample
                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Honda Civic      30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Porsche 914-2    26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Merc 450SLC      15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Duster 360       14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Fiat 128         32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Lotus Europa     30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Toyota Corona    21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Mazda RX4 Wag    21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

We considered a random sample of 10 rows from the dataset. Along with these, we might have to merge two different datasets. Let's see how this can be achieved. We can combine the data both row-wise as well as column-wise as follows:

sample1 <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample2 <- data[sample(1:nrow(data), 5, replace=FALSE),]
newdata <- rbind(sample1, sample2)

The preceding code is used to combine two datasets that share the same column format. Then we can combine them using the rbind function. Alternatively, if the two datasets have the same length of data but different columns, then we can combine them using the cind or merge functions:

newdata1 <- data[c(1,5:7)]
newdata2 <- data[c(8:11)]
newdata <- cbind(newdata1, newdata2)

When we have two different datasets with a common column, then we can use the merge function to combine them. On using merge, the dataset will be merged based on the common columns.

These are the essential concepts necessary to prepare the dataset for the analysis, which will be discussed in the next few chapters.