R Data Science Essentials

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.

R Data Science Essentials

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with R

Reading data from different sources

Reading data from a database

Data types in R

Data preprocessing techniques

Performing data operations

Control structures in R

Bringing data to a usable format

Summary

Exploratory Data Analysis

The Titanic dataset

Descriptive statistics

Inferential statistics

Univariate analysis

Bivariate analysis

Multivariate analysis

Summary

Pattern Discovery

Transactional datasets

Apriori analysis

Support, confidence, and lift

Generating filtering rules

Plotting

Sequential dataset

Apriori sequence analysis

Understanding the results

Business cases

Summary

Segmentation Using Clustering

Datasets

Centroid-based clustering and an ideal number of clusters

Implementation using K-means

Visualizing the clusters

Connectivity-based clustering

Visualizing the connectivity

Business use cases

Summary

Developing Regression Models

Datasets

Sampling the dataset

Logistic regression

Evaluating logistic regression

Linear regression

Evaluating linear regression

Methods to improve the accuracy

Ensemble models

Summary

Time Series Forecasting

Datasets

Extracting patterns

Forecasting using ARIMA

Forecasting using Holt-Winters

Methods to improve accuracy

Summary

Recommendation Engine

Dataset and transformation

Recommendations using user-based CF

Recommendations using item-based CF

Challenges and enhancements

Summary

Communicating Data Analysis

Dataset

Plotting using the googleVis package

Creating an interactive dashboard using Shiny

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading data from different sources

Importing data to R is quite simple and can be done from multiple sources. The most common method of importing data to R is through the comma-separated values (CSV) format. The CSV data can be accessed through the read.csv function. This is the simplest way to read the data as it requires just a single line command and the data is ready. Depending on the quality of the data, it may or may not require processing.

data <- read.csv("c:/local-data.csv")

The other function similar to read.csv is read.csv2. This function is also used to read the CSV files but the difference is that read.csv2 is mostly used in the European countries, where comma is used as decimal point and semicolon is used as a separator. Also, the data can be read from R using a few more parameters, such as read.table and read.delim. By default, read.delim is used to read tab-delimited files, and the read.table function can be used to read any file by supplying suitable parameters as the input:

data  <- read.delim("local-data.txt", header=TRUE, sep="\t")
data  <- read.table("local-data.txt", header=TRUE, sep="\t")

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

All the preceding functions can take multiple parameters that would explain the data source's format at best. Some of these parameters are as follows:

header: This is a logical value indicating the presence of column names in the file. When it is set to TRUE, it indicates that the column names are present. By default, the value is considered as TRUE.
sep: This defines the separator in the file. By default, the separator is comma for read.csv, tab for read.delim, and white space for the read.table function.
nrows: This specifies the maximum number of rows to read from the file. By default, the entire file will be read.
row.names: This will specify which column should be considered as a row name. When it is set as NULL, the row names will be forced as numbers. This parameter will take the column's position (one represents the first column) as input.
fill: This parameter when set as TRUE can read the data with unequal row lengths and blank fields are implicitly added.

These are some of the common parameters used along with the functions to read the data from a file.

We have so far explored reading data from a delimited file. In addition to this, we can read data in Excel formats as well. This can be achieved using the xlsx or XLConnect packages. We will see how to use one of these packages in order to read a worksheet from a workbook:

install.packages("xlsx")
library(xlsx)
mydata <- read.xlsx("DTH AnalysisV1.xlsx", 1)
head(mydata)

In the preceding code, we first installed the xlsx package that is required to read the Excel files. We loaded the package using the library function, then used the read.xlsx function to read the excel file, and passed an additional parameter, 1, that specifies which sheet to read from the excel file.

R Data Science Essentials

R Data Science Essentials

Overview of this book

Related Content you might be interested in

Current Title:

R Data Science Essentials

Reading data from different sources

Tip