R Data Science Essentials

R Data Science Essentials

Overview of this book

With organizations increasingly embedding data science across their enterprise and with management becoming more data-driven it is an urgent requirement for analysts and managers to understand the key concept of data science. The data science concepts discussed in this book will help you make key decisions and solve the complex problems you will inevitably face in this new world. R Data Science Essentials will introduce you to various important concepts in the field of data science using R. We start by reading data from multiple sources, then move on to processing the data, extracting hidden patterns, building predictive and forecasting models, building a recommendation engine, and communicating to the user through stunning visualizations and dashboards. By the end of this book, you will have an understanding of some very important techniques in data science, be able to implement them using R, understand and interpret the outcomes, and know how they helps businesses make a decision.

R Data Science Essentials

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with R

Reading data from different sources

Reading data from a database

Data types in R

Data preprocessing techniques

Performing data operations

Control structures in R

Bringing data to a usable format

Summary

Exploratory Data Analysis

The Titanic dataset

Descriptive statistics

Inferential statistics

Univariate analysis

Bivariate analysis

Multivariate analysis

Summary

Pattern Discovery

Transactional datasets

Apriori analysis

Support, confidence, and lift

Generating filtering rules

Plotting

Sequential dataset

Apriori sequence analysis

Understanding the results

Business cases

Summary

Segmentation Using Clustering

Datasets

Centroid-based clustering and an ideal number of clusters

Implementation using K-means

Visualizing the clusters

Connectivity-based clustering

Visualizing the connectivity

Business use cases

Summary

Developing Regression Models

Datasets

Sampling the dataset

Logistic regression

Evaluating logistic regression

Linear regression

Evaluating linear regression

Methods to improve the accuracy

Ensemble models

Summary

Time Series Forecasting

Datasets

Extracting patterns

Forecasting using ARIMA

Forecasting using Holt-Winters

Methods to improve accuracy

Summary

Recommendation Engine

Dataset and transformation

Recommendations using user-based CF

Recommendations using item-based CF

Challenges and enhancements

Summary

Communicating Data Analysis

Dataset

Plotting using the googleVis package

Creating an interactive dashboard using Shiny

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading data from a database

Apart from reading the data from a local file, R allows us to read the data from different sources and different formats. If we consider any enterprise setup, the data would mostly be present in a database. It will be complicated if we import the data from the database to a local file and then perform the analysis; we should be able to access the data directly from the source. This can be achieved using R.

First, let's see how to access data from a database. In order to read data from a database, we need to establish a connection with the database, which could reside in the local system or a remote server. We can establish the connection to the database using either ODBC or JDBC, which are R packages.

We will have a detailed look at accessing the data from the database using the JDBC method. In order to perform this operation, we need to install the RJDBC and sqldf packages. The RJDBC package is used to establish a connection with the database and the sqldf package is used to write the SQL queries:

install.packages("RJDBC")
library(RJDBC)
install.packages("sqldf")
library(sqldf)

We will now learn to establish a connection with the DB. We need to set up a few things in order to connect with the DB. To use the JDBC connection, we need to download the driver. The downloaded file will depend on the database to which we are going to connect, such as SQL Server, Oracle, or PostgreSQL.

In the following case, we will connect to a SQL server database. The JDBC driver can be downloaded from http://www.microsoft.com/en-in/download/details.aspx?id=11774 in order to provide connectivity. In the following code, we will pass the driver name as well as the location of the JAR file that comes with the download to the JDBC function. The JDBC function creates a new DBI driver that can be used to start the JDBC connection.

drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver", "C:/Users/Downloads/Microsoft SQL Server JDBC Driver 3.0/sqljdbc_3.0/enu/sqljdbc4.jar")

By using the dbConnect function, we establish the actual connection with the database. We need to pass the location of the database, username, and password to this function. On the successful execution of the following code, the connection will be established and we can check the connectivity using the dbGetQuery function:

conn <- dbConnect(drv, "jdbc:sqlserver://localhost;database=SAMPLE_DB", "admin", "test")
bin <- dbGetQuery(conn, "select count(*) from  sampletable")

In addition to the relational databases, we can also connect and access the non-relational databases, such as Cassandra, Hadoop, MongoDB, and so on. We will now see how to access data from a Cassandra database. In order to access the data from a Cassandra database, we need to install the RCassandra package. The connection to the database can be made using the RC.connect function. We need to pass the host IP as well as the port number to establish the connection. Then finally, we need to specify the username and password as follows to establish the connection successfully:

library(RCassandra)
conn <- RC.connect(host = "localhost", port = 9160)
RC.login(conn, username = "user", password = "password123")

In Cassandra, the container for the application data is keyspace, which is similar to schema in the relational database. We can use the RC.describe.keyspaces function to get an understanding about the data, and then using the RC.use function, we select keyspace to be used for all the subsequent operations:

RC.describe.keyspaces(conn)
RC.use(conn, keyspace = "sampleDB", cache.def = TRUE)

We can read the data using the following code and once all the readings are done, we can close the connection using RC.close:

a<-RC.read.table(conn, c.family = "Users", convert = TRUE, na.strings = "NA",
              as.is = FALSE, dec = ".")
RC.close(conn)

We can discuss the RMongo package as well as it is quite popular.

Similarly, R has the ability to read data from a table in a website using the XML package. We can also read the data of SAS, SPSS, Stata, and Systat using the Hmisc package for SPSS and SAS and foreign for Stata and Systat.

For more details about the methodology of extracting the data from these sources, find the reference at the following URLs:

While establishing connectivity with a remote system, you could face a few issues related to security and others specific to the R version and package version. Most likely, the common issues would have been discussed in the forum, stackoverflow.

R Data Science Essentials

R Data Science Essentials

Overview of this book

Related Content you might be interested in

Current Title:

R Data Science Essentials

Reading data from a database