Hands-On Ensemble Learning with R

Hands-On Ensemble Learning with R

By : Prabhanjan Narayanachar Tattar

Buy this Book

Hands-On Ensemble Learning with R

By: Prabhanjan Narayanachar Tattar

Buy this Book

Overview of this book

Ensemble techniques are used for combining two or more similar or dissimilar machine learning algorithms to create a stronger model. Such a model delivers superior prediction power and can give your datasets a boost in accuracy. Hands-On Ensemble Learning with R begins with the important statistical resampling methods. You will then walk through the central trilogy of ensemble techniques – bagging, random forest, and boosting – then you'll learn how they can be used to provide greater accuracy on large datasets using popular R packages. You will learn how to combine model predictions using different machine learning algorithms to build ensemble models. In addition to this, you will explore how to improve the performance of your ensemble models. By the end of this book, you will have learned how machine learning algorithms can be combined to reduce common problems and build simple efficient ensemble models with the help of real-world examples.

Hands-On Ensemble Learning with R

Contributors

Preface

Free Chapter

Introduction to Ensemble Techniques

Datasets

Statistical/machine learning models

The right model dilemma!

An ensemble purview

Complementary statistical tests

Summary

Bootstrapping

Technical requirements

The jackknife technique

Bootstrap – a statistical method

The boot package

Bootstrap and testing hypotheses

Bootstrapping regression models

Bootstrapping survival models*

Bootstrapping time series models*

Summary

Bagging

Technical requirements

Classification trees and pruning

Bagging

k-NN classifier

k-NN bagging

Summary

Random Forests

Technical requirements

Random Forests

Variable importance

Proximity plots

Random Forest nuances

Comparisons with bagging

Missing data imputation

Clustering with Random Forest

Summary

The Bare Bones Boosting Algorithms

Technical requirements

The general boosting algorithm

Adaptive boosting

Gradient boosting

Using the adabag and gbm packages

Variable importance

Comparing bagging, random forests, and boosting

Summary

Boosting Refinements

Technical requirements

Why does boosting work?

The gbm package

The xgboost package

The h2o package

Summary

The General Ensemble Technique

Technical requirements

Why does ensembling work?

Ensembling by voting

Ensembling by averaging

Stack ensembling

Summary

Ensemble Diagnostics

Technical requirements

What is ensemble diagnostics?

Ensemble diversity

Pairwise measure

Interrating agreement

Summary

Ensembling Regression Models

Technical requirements

Pre-processing the housing data

Visualization and variable reduction

Regression models

Bagging and Random Forests

Boosting regression models

Stacking methods for regression models

Summary

Ensembling Survival Models

Core concepts of survival analysis

Nonparametric inference

Regression models – parametric and Cox proportional hazards models

Survival tree

Ensemble survival models

Summary

Ensembling Time Series Models

Technical requirements

Time series datasets

Time series visualization

Core concepts and metrics

Essential time series models

Bagging and time series

Ensemble time series models

Summary

What's Next?

Bibliography

References

R package references

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Complementary statistical tests

Here, a model is selected over another plausible one. The accuracy of one model seems higher than the other. The area under curve (AUC) of the ROC of a model is greater than that of another. However, it is not appropriate to base the conclusion on pure numbers only. It is important to conclude whether the numbers hold significance from the point of view of statistical inference. In the analytical world, it is pivotal that we make use of statistical tests whenever they are available to validate claims/hypotheses. A reason for using statistical tests is that probability can be highly counterintuitive, and what appears on the surface might not be the case upon closer inspection, after incorporating the chance variation. For instance, if a fair coin is tossed 100 times, it is imprudent to think that the number of heads must be exactly 50. Hence, if a fair coin shows up 45 heads, we need to incorporate the chance variation that the number of heads can be less than 50 too. Caution must be exerted all the while when we are dealing with uncertain data. A few examples are in order here. Two variables might appear to be independent of each other, and the correlation might also be nearly equal to zero. However, applying the correlation test might result in the conclusion that the correlation is not significantly zero. Since we will be sampling and resampling a lot in this text, we will look at related tests.

Permutation test

Suppose that we have two processes, A and B, and the variances of these two processes are known to be equal, though unknown. Three independent observations from process A result in yields of 18, 20, and 22, while three independent observations from process B gives yields of 24, 26, and 28. Under the assumption that the yield follows a normal distribution, we would like to test whether the means of processes A and B are the same. This is a suitable case for applying the t-test, since the number of observations is smaller. An application of the t.test function shows that the two means are different to each other, and this intuitively appears to be the case.

Now, the assumption under the null hypothesis is that the means are equal, and that the variance is unknown and assumed to be equal under the two processes. Consequently, we have a genuine reason to believe that the observations from process A might well have occurred in process B too, and vice versa. We can therefore swap one observation in process B with process A, and recompute the t-test. The process can be repeated for all possible permutations of the two samples. In general, if we have m samples from population 1 and n samples from population 2, we can have

different samples and as many tests. An overall test can be based on such permutation samples and such tests are called permutation tests.

For process A and B observations, we will first apply the t-test and then the permutation test. The t.test is available in the core stats package and the permutation t-test is taken from the perm package:

> library(perm)
> x <- c(18,20,22); y <- c(24,26,28)
> t.test(x,y,var.equal = TRUE)
Two Sample t-test
data:  x and y
t = -3.6742346, df = 4, p-value = 0.02131164
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.533915871  -1.466084129
sample estimates:
mean of x mean of y 
       20        26

The smaller p-value suggests that the means of processes A and B are not equal. Consequently, we now apply the permutation test permTS from the perm package:

> permTS(x,y)
Exact Permutation Test (network algorithm)
data:  x and y
p-value = 0.1
alternative hypothesis: true mean x - mean y is not equal to 0
sample estimates:
mean x - mean y 
             -6

The p-value is now at 0.1, which means that the permutation test concludes that the means of the processes are equal. Does this mean that the permutation test will always lead to this conclusion, contradicting the t-test? The answer is given in the next code segment:

> x2 <- c(16,18,20,22); y2 <- c(24,26,28,30)
> t.test(x2,y2,var.equal = TRUE)
Two Sample t-test
data:  x2 and y2
t = -4.3817805, df = 6, p-value = 0.004659215
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12.46742939  -3.53257061
sample estimates:
mean of x mean of y 
       19        27 
> permTS(x2,y2)
Exact Permutation Test (network algorithm)
data:  x2 and y2
p-value = 0.02857143
alternative hypothesis: true mean x2 - mean y2 is not equal to 0
sample estimates:
mean x2 - mean y2 
               -8

Chi-square and McNemar test

We had five models for the hypothyroid test. We then calculated the accuracy and were satisfied with the numbers. Let's first look at the number of errors that the fitted model makes. We have 636 observations in the test partition and 42 of them test positive for the hypothyroid problem. Note that if we mark all the patients as negative, we would be getting an accuracy of 1-42/636 = 0.934, or about 93.4%. Using the table function, we pit the actuals against the predicted values and see how often the fitted model goes wrong. We remark here that identifying the hypothyroid cases as the same and the negative cases as negative is the correct prediction, while marking the hypothyroid case as negative and vice versa leads to errors. For each model, we look at the misclassification errors:

> table(LR_Predict_Bin,testY_numeric)
              testY_numeric
LR_Predict_Bin   1   2
             1  32   7
             2  10 587
> table(NN_Predict,HT2_TestY)
             HT2_TestY
NN_Predict    hypothyroid negative
  hypothyroid          41       22
  negative              1      572
> table(NB_predict,HT2_TestY)
             HT2_TestY
NB_predict    hypothyroid negative
  hypothyroid          33        8
  negative              9      586
> table(CT_predict,HT2_TestY)
             HT2_TestY
CT_predict    hypothyroid negative
  hypothyroid          38        4
  negative              4      590
> table(SVM_predict,HT2_TestY)
             HT2_TestY
SVM_predict   hypothyroid negative
  hypothyroid          34        2
  negative              8      592

From the misclassification table, we can see that the neural network identifies 41 out of the 42 cases of hypothyroid correctly, but it identifies way more cases of hypothyroid incorrectly too. The question that arises is whether the correct predictions of the fitted models only occur by chance, or whether they depend on truth and can be explained. To test this, in the hypotheses framework we would like to test whether the actuals and predicted values of the actuals are independent of or dependent on each other. Technically, the null hypothesis is that the prediction is independent of the actual, and if a model explains the truth, the null hypothesis must be rejected. We should conclude that the fitted model predictions depend on the truth. We deploy two solutions here, the chi-square test and the McNemar test:

> chisq.test(table(LR_Predict_Bin,testY_numeric))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(LR_Predict_Bin, testY_numeric)
X-squared = 370.53501, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(NN_Predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(NN_Predict, HT2_TestY)
X-squared = 377.22569, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(NB_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(NB_predict, HT2_TestY)
X-squared = 375.18659, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(CT_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(CT_predict, HT2_TestY)
X-squared = 498.44791, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(SVM_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(SVM_predict, HT2_TestY)
X-squared = 462.41803, df = 1, p-value < 0.00000000000000022204
> mcnemar.test(table(LR_Predict_Bin,testY_numeric))
McNemar's Chi-squared test with continuity correction
data:  table(LR_Predict_Bin, testY_numeric)
McNemar's chi-squared = 0.23529412, df = 1, p-value = 0.6276258
> mcnemar.test(table(NN_Predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(NN_Predict, HT2_TestY)
McNemar's chi-squared = 17.391304, df = 1, p-value = 0.00003042146
> mcnemar.test(table(NB_predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(NB_predict, HT2_TestY)
McNemar's chi-squared = 0, df = 1, p-value = 1
> mcnemar.test(table(CT_predict,HT2_TestY))
McNemar's Chi-squared test
data:  table(CT_predict, HT2_TestY)
McNemar's chi-squared = 0, df = 1, p-value = 1
> mcnemar.test(table(SVM_predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(SVM_predict, HT2_TestY)
McNemar's chi-squared = 2.5, df = 1, p-value = 0.1138463

The answer provided by the chi-square tests clearly shows that the predictions of each fitted model is not down to chance. It also shows that the prediction of hypothyroid cases, as well as the negative cases, is expected of the fitted models. The interpretation of and conclusions from the McNemar's test is left to the reader. The final important measure in classification problems is the ROC curve, which is considered next.

ROC test

The ROC curve is an important improvement on the false positive and true negative measures of model performance. For a detailed explanation, refer to Chapter 9 of Tattar et al. (2017). The ROC curve basically plots the true positive rate against the false positive rate, and we measure the AUC for the fitted model.

The main goal that the ROC test attempts to achieve is the following. Suppose that Model 1 gives an AUC of 0.89 and Model 2 gives 0.91. Using the simple AUC criteria, we outright conclude that Model 2 is better than Model 1. However, an important question that arises is whether 0.91 is significantly higher than 0.89. The roc.test, from the pROC R package, provides the answer here. For the neural network and classification tree, the following R segment gives the required answer:

> library(pROC)
> HT_NN_Prob <- predict(NN_fit,newdata=HT2_TestX,type="raw")
> HT_NN_roc <- roc(HT2_TestY,c(HT_NN_Prob))
> HT_NN_roc$auc
Area under the curve: 0.9723826
> HT_CT_Prob <- predict(CT_fit,newdata=HT2_TestX,type="prob")[,2]
> HT_CT_roc <- roc(HT2_TestY,HT_CT_Prob)
> HT_CT_roc$auc
Area under the curve: 0.9598765
> roc.test(HT_NN_roc,HT_CT_roc)
	DeLong's test for two correlated ROC curves
data:  HT_NN_roc and HT_CT_roc
Z = 0.72452214, p-value = 0.4687452
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
 AUC of roc1  AUC of roc2 
0.9723825557 0.9598765432

Since the p-value is very large, we conclude that the AUC for the two models is not significantly different.

Statistical tests are vital and we recommend that they be used whenever suitable. The concepts highlighted in this chapter will be drawn on in more detail in the rest of the book.

Hands-On Ensemble Learning with R

By : Prabhanjan Narayanachar Tattar

Hands-On Ensemble Learning with R

By: Prabhanjan Narayanachar Tattar

Overview of this book

Related Content you might be interested in

Current Title:

Hands-On Ensemble Learning with R

Statistical Application Development with R and Python

Machine Learning with R Cookbook

Practical Data Science Cookbook, Second Edition

Complementary statistical tests

Permutation test

Chi-square and McNemar test

ROC test