Machine Learning with R Cookbook

Machine Learning with R Cookbook

By : Yu-Wei, Chiu (David Chiu)

Buy this Book

Machine Learning with R Cookbook

By: Yu-Wei, Chiu (David Chiu)

Buy this Book

Overview of this book

<p>The R language is a powerful open source functional programming language. At its core, R is a statistical programming language that provides impressive tools to analyze data and create high-level graphics.</p> <p>This book covers the basics of R by setting up a user-friendly programming environment and performing data ETL in R. Data exploration examples are provided that demonstrate how powerful data visualization and machine learning is in discovering hidden relationships. You will then dive into important machine learning topics, including data classification, regression, clustering, association rule mining, and dimension reduction.</p>

Machine Learning with R Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Practical Machine Learning with R

Introduction

Downloading and installing R

Downloading and installing RStudio

Installing and loading packages

Reading and writing data

Using R to manipulate data

Applying basic statistics

Visualizing data

Getting a dataset for machine learning

Data Exploration with RMS Titanic

Introduction

Reading a Titanic dataset from a CSV file

Converting types on character variables

Detecting missing values

Imputing missing values

Exploring and visualizing data

Predicting passenger survival with a decision tree

Validating the power of prediction with a confusion matrix

Assessing performance with the ROC curve

R and Statistics

Introduction

Understanding data sampling in R

Operating a probability distribution in R

Working with univariate descriptive statistics in R

Performing correlations and multivariate analysis

Operating linear regression and multivariate analysis

Conducting an exact binomial test

Performing student's t-test

Performing the Kolmogorov-Smirnov test

Understanding the Wilcoxon Rank Sum and Signed Rank test

Working with Pearson's Chi-squared test

Conducting a one-way ANOVA

Performing a two-way ANOVA

Understanding Regression Analysis

Introduction

Fitting a linear regression model with lm

Summarizing linear model fits

Using linear regression to predict unknown values

Generating a diagnostic plot of a fitted model

Fitting a polynomial regression model with lm

Fitting a robust linear regression model with rlm

Studying a case of linear regression on SLID data

Applying the Gaussian model for generalized linear regression

Applying the Poisson model for generalized linear regression

Applying the Binomial model for generalized linear regression

Fitting a generalized additive model to data

Visualizing a generalized additive model

Diagnosing a generalized additive model

Classification (I) – Tree, Lazy, and Probabilistic

Introduction

Preparing the training and testing datasets

Building a classification model with recursive partitioning trees

Visualizing a recursive partitioning tree

Measuring the prediction performance of a recursive partitioning tree

Pruning a recursive partitioning tree

Building a classification model with a conditional inference tree

Visualizing a conditional inference tree

Measuring the prediction performance of a conditional inference tree

Classifying data with the k-nearest neighbor classifier

Classifying data with logistic regression

Classifying data with the Naïve Bayes classifier

Classification (II) – Neural Network and SVM

Introduction

Classifying data with a support vector machine

Choosing the cost of a support vector machine

Visualizing an SVM fit

Predicting labels based on a model trained by a support vector machine

Tuning a support vector machine

Training a neural network with neuralnet

Visualizing a neural network trained by neuralnet

Predicting labels based on a model trained by neuralnet

Training a neural network with nnet

Predicting labels based on a model trained by nnet

Model Evaluation

Introduction

Estimating model performance with k-fold cross-validation

Performing cross-validation with the e1071 package

Performing cross-validation with the caret package

Ranking the variable importance with the caret package

Ranking the variable importance with the rminer package

Finding highly correlated features with the caret package

Selecting features using the caret package

Measuring the performance of the regression model

Measuring prediction performance with a confusion matrix

Measuring prediction performance using ROCR

Comparing an ROC curve using the caret package

Measuring performance differences between models with the caret package

Ensemble Learning

Introduction

Classifying data with the bagging method

Performing cross-validation with the bagging method

Classifying data with the boosting method

Performing cross-validation with the boosting method

Classifying data with gradient boosting

Calculating the margins of a classifier

Calculating the error evolution of the ensemble method

Classifying data with random forest

Estimating the prediction errors of different classifiers

Clustering

Introduction

Clustering data with hierarchical clustering

Cutting trees into clusters

Clustering data with the k-means method

Drawing a bivariate cluster plot

Comparing clustering methods

Extracting silhouette information from clustering

Obtaining the optimum number of clusters for k-means

Clustering data with the density-based method

Clustering data with the model-based method

Visualizing a dissimilarity matrix

Validating clusters externally

Association Analysis and Sequence Mining

Introduction

Transforming data into transactions

Displaying transactions and associations

Mining associations with the Apriori rule

Pruning redundant rules

Visualizing association rules

Mining frequent itemsets with Eclat

Creating transactions with temporal information

Mining frequent sequential patterns with cSPADE

Dimension Reduction

Introduction

Performing feature selection with FSelector

Performing dimension reduction with PCA

Determining the number of principal components using the scree test

Determining the number of principal components using the Kaiser method

Visualizing multivariate data using biplot

Performing dimension reduction with MDS

Reducing dimensions with SVD

Compressing images with SVD

Performing nonlinear dimension reduction with ISOMAP

Performing nonlinear dimension reduction with Local Linear Embedding

Big Data Analysis (R and Hadoop)

Introduction

Preparing the RHadoop environment

Installing rmr2

Installing rhdfs

Operating HDFS with rhdfs

Implementing a word count problem with RHadoop

Comparing the performance between an R MapReduce program and a standard R program

Testing and debugging the rmr2 program

Installing plyrmr

Manipulating data with plyrmr

Conducting machine learning with RHadoop

Configuring RHadoop clusters on Amazon EMR

Resources for R and Machine Learning

Dataset – Survival of Passengers on the Titanic

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Using R to manipulate data

This recipe will discuss how to use the built-in R functions to manipulate data. As data manipulation is the most time consuming part of most analysis procedures, you should gain knowledge of how to apply these functions on data.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to manipulate the data with R.

Subset the data using the bracelet notation:

Load the dataset iris into the R session:
```
> data(iris)
```
To select values, you may use a bracket notation that designates the indices of the dataset. The first index is for the rows and the second for the columns:
```
> iris[1,"Sepal.Length"]
[1] 5.1
```

You can also select multiple columns using c():

> Sepal.iris = iris[, c("Sepal.Length", "Sepal.Width")]

You can then use str() to summarize and display the internal structure of Sepal.iris:

> str(Sepal.iris)
'data.frame':  150 obs. of  2 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ..

To subset data with the rows of given indices, you can specify the indices at the first index with the bracket notation. In this example, we show you how to subset data with the top five records with the Sepal.Length column and the Sepal.Width selected:
```
> Five.Sepal.iris = iris[1:5, c("Sepal.Length", "Sepal.Width")]
> str(Five.Sepal.iris)
'data.frame':	5 obs. of  2 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6
```

It is also possible to set conditions to filter the data. For example, to filter returned records containing the setosa data with all five variables. In the following example, the first index specifies the returning criteria, and the second index specifies the range of indices of the variable returned:

> setosa.data = iris[iris$Species=="setosa",1:5]
> str(setosa.data)
'data.frame':	50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Alternatively, the which function returns the indexes of satisfied data. The following example returns indices of the iris data containing species equal to setosa:

> which(iris$Species=="setosa")
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50

The indices returned by the operation can then be applied as the index to select the iris containing the setosa species. The following example returns the setosa with all five variables:

> setosa.data = iris[which(iris$Species=="setosa"),1:5]
> str(setosa.data)
'data.frame':	50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Subset data using the subset function:

Besides using the bracket notation, R provides a subset function that enables users to subset the data frame by observations with a logical statement.

First, subset species, sepal length, and sepal width out of the iris data. To select the sepal length and width out of the iris data, one should specify the column to be subset in the select argument:

> Sepal.data = subset(iris, select=c("Sepal.Length", "Sepal.Width"))
> str(Sepal.data)
'data.frame': 150 obs. of  2 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

This reveals that Sepal.data contains 150 objects with the Sepal.Length variable and Sepal.Width.

On the other hand, you can use a subset argument to get subset data containing setosa only. In the second argument of the subset function, you can specify the subset criteria:

> setosa.data = subset(iris, Species =="setosa")
> str(setosa.data)
'data.frame': 50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Most of the time, you may want to apply a union or intersect a condition while subsetting data. The OR and AND operations can be further employed for this purpose. For example, if you would like to retrieve data with Petal.Width >=0.2 and Petal.Length < = 1.4:
```
> example.data= subset(iris, Petal.Length <=1.4 & Petal.Width >= 0.2, select=Species )
> str(example.data)
'data.frame': 21 obs. of  1 variable:
 $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

Merging data: merging data involves joining two data frames into a merged data frame by a common column or row name. The following example shows how to merge the flower.type data frame and the first three rows of the iris with a common row name within the Species column:

> flower.type = data.frame(Species = "setosa", Flower = "iris")
> merge(flower.type, iris[1:3,], by ="Species")
  Species Flower Sepal.Length Sepal.Width Petal.Length Petal.Width
1  setosa   iris          5.1         3.5          1.4         0.2
2  setosa   iris          4.9         3.0          1.4         0.2
3  setosa   iris          4.7         3.2          1.3         0.2

Ordering data: the order function will return the index of a sorted data frame with a specified column. The following example shows the results from the first six records with the sepal length ordered (from big to small) iris data

> head(iris[order(iris$Sepal.Length, decreasing = TRUE),])
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
132          7.9         3.8          6.4         2.0 virginica
118          7.7         3.8          6.7         2.2 virginica
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica
136          7.7         3.0          6.1         2.3 virginica
106          7.6         3.0          6.6         2.1 virginica

How it works

Before conducting data analysis, it is important to organize collected data into a structured format. Therefore, we can simply use the R data frame to subset, merge, and order a dataset. This recipe first introduces two methods to subset data: one uses the bracket notation, while the other uses the subset function. You can use both methods to generate the subset data by selecting columns and filtering data with the given criteria. The recipe then introduces the merge function to merge data frames. Last, the recipe introduces how to use order to sort the data.

There's more...

The sub and gsub functions allow using regular expression to substitute a string. The sub and gsub functions perform the replacement of the first and all the other matches, respectively:

> sub("e", "q", names(iris))
[1] "Sqpal.Length" "Sqpal.Width"  "Pqtal.Length" "Pqtal.Width"  "Spqcies"     
> gsub("e", "q", names(iris))
[1] "Sqpal.Lqngth" "Sqpal.Width"  "Pqtal.Lqngth" "Pqtal.Width"  "Spqciqs"

Machine Learning with R Cookbook

By : Yu-Wei, Chiu (David Chiu)

Machine Learning with R Cookbook

By: Yu-Wei, Chiu (David Chiu)

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning with R Cookbook

Using R to manipulate data

Getting ready

How to do it...

How it works

There's more...