Machine Learning with R Cookbook, Second Edition

Machine Learning with R Cookbook, Second Edition - Second Edition

By : Yu-Wei, Chiu (David Chiu)

Buy this Book

Machine Learning with R Cookbook, Second Edition - Second Edition

By: Yu-Wei, Chiu (David Chiu)

Buy this Book

Overview of this book

Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and a much more challenging one. Machine Learning with R Cookbook, Second Edition uses a practical approach to teach you how to perform machine learning with R. Each chapter is divided into several simple recipes. Through the step-by-step instructions provided in each recipe, you will be able to construct a predictive model by using a variety of machine learning packages. In this book, you will first learn to set up the R environment and use simple R commands to explore data. The next topic covers how to perform statistical analysis with machine learning analysis and assess created models, covered in detail later on in the book. You'll also learn how to integrate R and Hadoop to create a big data analysis platform. The detailed illustrations provide all the information required to start applying machine learning to individual projects. With Machine Learning with R Cookbook, machine learning has never been easier.

Title Page

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Practical Machine Learning with R

Introduction

Downloading and installing R

Downloading and installing RStudio

Installing and loading packages

Understanding of basic data structures

Basic commands for subsetting

Reading and writing data

Manipulating data

Applying basic statistics

Visualizing data

Getting a dataset for machine learning

Data Exploration with Air Quality Datasets

Introduction

Using air quality dataset

Converting attributes to factor

Detecting missing values

Imputing missing values

Exploring and visualizing data

Predicting values from datasets

Analyzing Time Series Data

Introduction

Looking at time series data

Plotting and forecasting time series data

Extracting, subsetting, merging, filling, and padding

Successive differences and moving averages

Exponential smoothing

Plotting the autocorrelation function

R and Statistics

Introduction

Understanding data sampling in R

Operating a probability distribution in R

Working with univariate descriptive statistics in R

Performing correlations and multivariate analysis

Conducting an exact binomial test

Performing a student's t-test

Performing the Kolmogorov-Smirnov test

Understanding the Wilcoxon Rank Sum and Signed Rank test

Working with Pearson's Chi-squared test

Conducting a one-way ANOVA

Performing a two-way ANOVA

Understanding Regression Analysis

Introduction

Different types of regression

Fitting a linear regression model with lm

Summarizing linear model fits

Using linear regression to predict unknown values

Generating a diagnostic plot of a fitted model

Fitting multiple regression

Summarizing multiple regression

Using multiple regression to predict unknown values

Fitting a polynomial regression model with lm

Fitting a robust linear regression model with rlm

Studying a case of linear regression on SLID data

Applying the Gaussian model for generalized linear regression

Applying the Poisson model for generalized linear regression

Applying the Binomial model for generalized linear regression

Fitting a generalized additive model to data

Visualizing a generalized additive model

Diagnosing a generalized additive model

Survival Analysis

Introduction

Loading and observing data

Viewing the summary of survival analysis

Visualizing the Survival Curve

Using the log-rank test

Using the COX proportional hazard model

Nelson-Aalen Estimator of cumulative hazard

Classification 1 - Tree, Lazy, and Probabilistic

Introduction

Preparing the training and testing datasets

Building a classification model with recursive partitioning trees

Visualizing a recursive partitioning tree

Measuring the prediction performance of a recursive partitioning tree

Pruning a recursive partitioning tree

Handling missing data and split and surrogate variables

Building a classification model with a conditional inference tree

Control parameters in conditional inference trees

Visualizing a conditional inference tree

Measuring the prediction performance of a conditional inference tree

Classifying data with the k-nearest neighbor classifier

Classifying data with logistic regression

Classifying data with the Naïve Bayes classifier

Classification 2 - Neural Network and SVM

Introduction

Classifying data with a support vector machine

Choosing the cost of a support vector machine

Visualizing an SVM fit

Predicting labels based on a model trained by a support vector machine

Tuning a support vector machine

The basics of neural network

Training a neural network with neuralnet

Visualizing a neural network trained by neuralnet

Predicting labels based on a model trained by neuralnet

Training a neural network with nnet

Predicting labels based on a model trained by nnet

Model Evaluation

Introduction

Estimating model performance with k-fold cross-validation

Estimating model performance with Leave One Out Cross Validation

Performing cross-validation with the e1071 package

Performing cross-validation with the caret package

Ranking the variable importance with the caret package

Ranking the variable importance with the rminer package

Finding highly correlated features with the caret package

Selecting features using the caret package

Measuring the performance of the regression model

Measuring prediction performance with a confusion matrix

Measuring prediction performance using ROCR

Comparing an ROC curve using the caret package

Measuring performance differences between models with the caret package

Ensemble Learning

Introduction

Using the Super Learner algorithm

Using ensemble to train and test

Classifying data with the bagging method

Performing cross-validation with the bagging method

Classifying data with the boosting method

Performing cross-validation with the boosting method

Classifying data with gradient boosting

Calculating the margins of a classifier

Calculating the error evolution of the ensemble method

Classifying data with random forest

Estimating the prediction errors of different classifiers

Clustering

Introduction

Clustering data with hierarchical clustering

Cutting trees into clusters

Clustering data with the k-means method

Drawing a bivariate cluster plot

Comparing clustering methods

Extracting silhouette information from clustering

Obtaining the optimum number of clusters for k-means

Clustering data with the density-based method

Clustering data with the model-based method

Visualizing a dissimilarity matrix

Validating clusters externally

Association Analysis and Sequence Mining

Introduction

Transforming data into transactions

Displaying transactions and associations

Mining associations with the Apriori rule

Pruning redundant rules

Visualizing association rules

Mining frequent itemsets with Eclat

Creating transactions with temporal information

Mining frequent sequential patterns with cSPADE

Using the TraMineR package for sequence analysis

Visualizing sequence, Chronogram, and Traversal Statistics

Dimension Reduction

Introduction

Why to reduce the dimension?

Performing feature selection with FSelector

Performing dimension reduction with PCA

Determining the number of principal components using the scree test

Determining the number of principal components using the Kaiser method

Visualizing multivariate data using biplot

Performing dimension reduction with MDS

Reducing dimensions with SVD

Compressing images with SVD

Performing nonlinear dimension reduction with ISOMAP

Performing nonlinear dimension reduction with Local Linear Embedding

Big Data Analysis (R and Hadoop)

Introduction

Preparing the RHadoop environment

Installing rmr2

Installing rhdfs

Operating HDFS with rhdfs

Implementing a word count problem with RHadoop

Comparing the performance between an R MapReduce program and a standard R program

Testing and debugging the rmr2 program

Installing plyrmr

Manipulating data with plyrmr

Conducting machine learning with RHadoop

Configuring RHadoop clusters on Amazon EMR

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Applying basic statistics

R provides a wide range of statistical functions, allowing users to obtain the summary statistics of data, generate frequency and contingency tables, produce correlations, and conduct statistical inferences. This recipe covers basic statistics that can be applied to a dataset.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to apply statistics to a dataset:

Load the iris data into an R session:

> data(iris)

Observe the format of the data:

        > class(iris)
        [1] "data.frame"

The iris dataset is a DataFrame containing four numeric attributes: petal length, petal width, sepal width, and sepal length. For numeric values, you can perform descriptive statistics, such as mean, sd, var, min, max, median, range, and quantile. These can be applied to any of the four attributes in the dataset:

        > mean(iris$Sepal.Length)
        Output:
        [1] 5.843333
        > sd(iris$Sepal.Length)
        Output:
        [1] 0.8280661
        > var(iris$Sepal.Length)
        Output:
        [1] 0.6856935
        > min(iris$Sepal.Length)
        Output:
        [1] 4.3
        > max(iris$Sepal.Length)
        Output:
        [1] 7.9
        > median(iris$Sepal.Length)
        Output:
        [1] 5.8
        > range(iris$Sepal.Length)
        Output:
        [1] 4.3 7.9
        > quantile(iris$Sepal.Length)
        Output:
        0%  25%  50%  75% 100% 
        4.3  5.1  5.8  6.4  7.9

The preceding example demonstrates how to apply descriptive statistics to a single variable. In order to obtain summary statistics on every numeric attribute of the DataFrame, one may use sapply. For example, to apply the mean on the first four attributes in the iris DataFrame, ignore the na value by setting na.rm as TRUE:

        > sapply(iris[1:4], mean, na.rm=TRUE)
        Output:
        Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          5.843333     3.057333     3.758000     1.199333

As an alternative to using sapply to apply descriptive statistics on given attributes, R offers the summary function that provides a full range of descriptive statistics. In the following example, the summary function provides the mean, median, 25th and 75th quartiles, min, and max of every iris dataset numeric attribute:

        > summary(iris)
        Output:
        Sepal.Length  Sepal.Width   Petal.Length   Petal.Width  Species  
        Min.  4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa    :50  
        1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
        versicolor:50  
        Median :5.800 Median :3.000 Median :4.350 Median :1.300 
        virginica :50  
        Mean :5.843 Mean   :3.057 Mean   :3.758 Mean   :1.199                  
        3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800                  
        Max. :7.900 Max.   :4.400 Max.   :6.900 Max.   :2.500

The preceding example shows how to output the descriptive statistics of a single variable. R also provides the correlation for users to investigate the relationship between variables. The following example generates a 4x4 matrix by computing the correlation of each attribute pair within the iris:

        > cor(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
        Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
        Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
        Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

R also provides a function to compute the covariance of each attribute pair within the iris dataset:

        > cov(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
        Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
        Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
        Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

Statistical tests are performed to access the significance of the results; here we demonstrate how to use a t-test to determine the statistical differences between two samples. In this example, we perform a t.test on the petal width an of an iris in either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be certain that the petal width between the setosa and versicolor will vary significantly:

        > t.test(iris$Petal.Width[iris$Species=="setosa"], 
        +        iris$Petal.Width[iris$Species=="versicolor"])
        Output:
        
        Welch Two Sample t-test
        
        data:  iris$Petal.Width[iris$Species == "setosa"] and 
        iris$Petal.Width[iris$Species == "versicolor"]
        t = -34.0803, df = 74.755, p-value < 2.2e-16
        alternative hypothesis: true difference in means is not equal to 0
        95 percent confidence interval:
       -1.143133 -1.016867
        sample estimates:
        mean of x mean of y 
        0.246     1.326

Alternatively, you can perform a correlation test on the sepal length to the sepal width of an iris, and then retrieve a correlation score between the two variables. The stronger the positive correlation, the closer the value is to 1. The stronger the negative correlation, the closer the value is to -1:

        > cor.test(iris$Sepal.Length, iris$Sepal.Width)
        Output:
        Pearson's product-moment correlation
        data:  iris$Sepal.Length and iris$Sepal.Width
        t = -1.4403, df = 148, p-value = 0.1519
        alternative hypothesis: true correlation is not equal to 0
        95 percent confidence interval:
       -0.27269325  0.04351158
        sample estimates:
            cor 
       -0.1175698

How it works...

R has a built-in statistics function, which enables the user to perform descriptive statistics on a single variable. The recipe first introduces how to apply mean, sd, var, min, max, median, range, and quantile on a single variable. Moreover, in order to apply the statistics on all four numeric variables, one can use the sapply function. In order to determine the relationships between multiple variables, one can conduct correlation and covariance. Finally, the recipe shows how to determine the statistical differences of two given samples by performing a statistical test.

There's more...

If you need to compute an aggregated summary of statistics against data in different groups, you can use the aggregate and reshape functions to compute the summary statistics of data subsets:

Use aggregate to calculate the mean of each iris attribute group by the species:

        > aggregate(x=iris[,1:4],by=list(iris$Species),FUN=mean)

Use reshape to calculate the mean of each iris attribute group by the species:

        >  library(reshape)
        >  iris.melt <- melt(iris,id='Species')
        >  cast(Species~variable,data=iris.melt,mean,
             subset=Species %in% c('setosa','versicolor'),
             margins='grand_row')

For information on reshape and aggregate, refer to the help documents by using ?reshape or ?aggregate.

Machine Learning with R Cookbook, Second Edition - Second Edition

By : Yu-Wei, Chiu (David Chiu)

Machine Learning with R Cookbook, Second Edition - Second Edition

By: Yu-Wei, Chiu (David Chiu)

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning with R Cookbook, Second Edition - Second Edition

R Data Analysis Cookbook

Hands-On Ensemble Learning with R

Mastering Machine Learning with R

Applying basic statistics

Getting ready

How to do it...

How it works...

There's more...