Book Image

Python: Deeper Insights into Machine Learning

By : David Julian, Sebastian Raschka, John Hearty
Book Image

Python: Deeper Insights into Machine Learning

By: David Julian, Sebastian Raschka, John Hearty

Overview of this book

Machine learning and predictive analytics are becoming one of the key strategies for unlocking growth in a challenging contemporary marketplace. It is one of the fastest growing trends in modern computing, and everyone wants to get into the field of machine learning. In order to obtain sufficient recognition in this field, one must be able to understand and design a machine learning system that serves the needs of a project. The idea is to prepare a learning path that will help you to tackle the real-world complexities of modern machine learning with innovative and cutting-edge techniques. Also, it will give you a solid foundation in the machine learning design process, and enable you to build customized machine learning models to solve unique problems. The course begins with getting your Python fundamentals nailed down. It focuses on answering the right questions that cove a wide range of powerful Python libraries, including scikit-learn Theano and Keras.After getting familiar with Python core concepts, it’s time to dive into the field of data science. You will further gain a solid foundation on the machine learning design and also learn to customize models for solving problems. At a later stage, you will get a grip on more advanced techniques and acquire a broad set of powerful skills in the area of feature selection and feature engineering.
Table of Contents (6 chapters)
4
A. Biblography
5
Index

Chapter 7. Combining Different Models for Ensemble Learning

In the previous chapter, we focused on the best practices for tuning and evaluating different models for classification. In this chapter, we will build upon these techniques and explore different methods for constructing a set of classifiers that can often have a better predictive performance than any of its individual members. You will learn how to:

  • Make predictions based on majority voting
  • Reduce overfitting by drawing random combinations of the training set with repetition
  • Build powerful models from weak learners that learn from their mistakes

Learning with ensembles

The goal behind ensemble methods is to combine different classifiers into a meta-classifier that has a better generalization performance than each individual classifier alone. For example, assuming that we collected predictions from 10 experts, ensemble methods would allow us to strategically combine these predictions by the 10 experts to come up with a prediction that is more accurate and robust than the predictions by each individual expert. As we will see later in this chapter, there are several different approaches for creating an ensemble of classifiers. In this section, we will introduce a basic perception about how ensembles work and why they are typically recognized for yielding a good generalization performance.

In this chapter, we will focus on the most popular ensemble methods that use the majority voting principle. Majority voting simply means that we select the class label that has been predicted by the majority of classifiers, that is, received more than 50 percent of the votes. Strictly speaking, the term majority vote refers to binary class settings only. However, it is easy to generalize the majority voting principle to multi-class settings, which is called plurality voting. Here, we select the class label that received the most votes (mode). The following diagram illustrates the concept of majority and plurality voting for an ensemble of 10 classifiers where each unique symbol (triangle, square, and circle) represents a unique class label:

Learning with ensembles

Using the training set, we start by training m different classifiers (Learning with ensembles). Depending on the technique, the ensemble can be built from different classification algorithms, for example, decision trees, support vector machines, logistic regression classifiers, and so on. Alternatively, we can also use the same base classification algorithm fitting different subsets of the training set. One prominent example of this approach would be the random forest algorithm, which combines different decision tree classifiers. The following diagram illustrates the concept of a general ensemble approach using majority voting:

Learning with ensembles

To predict a class label via a simple majority or plurality voting, we combine the predicted class labels of each individual classifier Learning with ensembles and select the class label Learning with ensembles that received the most votes:

Learning with ensembles

For example, in a binary classification task where Learning with ensembles and Learning with ensembles, we can write the majority vote prediction as follows:

Learning with ensembles

To illustrate why ensemble methods can work better than individual classifiers alone, let's apply the simple concepts of combinatorics. For the following example, we make the assumption that all n base classifiers for a binary classification task have an equal error rate Learning with ensembles. Furthermore, we assume that the classifiers are independent and the error rates are not correlated. Under those assumptions, we can simply express the error probability of an ensemble of base classifiers as a probability mass function of a binomial distribution:

Learning with ensembles

Here, Learning with ensembles is the binomial coefficient n choose k. In other words, we compute the probability that the prediction of the ensemble is wrong. Now let's take a look at a more concrete example of 11 base classifiers (Learning with ensembles) with an error rate of 0.25 (Learning with ensembles):

Learning with ensembles

As we can see, the error rate of the ensemble (0.034) is much lower than the error rate of each individual classifier (0.25) if all the assumptions are met. Note that, in this simplified illustration, a 50-50 split by an even number of classifiers n is treated as an error, whereas this is only true half of the time. To compare such an idealistic ensemble classifier to a base classifier over a range of different base error rates, let's implement the probability mass function in Python:

>>> from scipy.misc import comb
>>> import math
>>> def ensemble_error(n_classifier, error):
...     k_start = math.ceil(n_classifier / 2.0)
...     probs = [comb(n_classifier, k) * 
...              error**k * 
...              (1-error)**(n_classifier - k) 
...              for k in range(k_start, n_classifier + 1)]
...     return sum(probs)
>>> ensemble_error(n_classifier=11, error=0.25)
0.034327507019042969

After we've implemented the ensemble_error function, we can compute the ensemble error rates for a range of different base errors from 0.0 to 1.0 to visualize the relationship between ensemble and base errors in a line graph:

>>> import numpy as np
>>> error_range = np.arange(0.0, 1.01, 0.01)
>>> ens_errors = [ensemble_error(n_classifier=11, error=error) 
...               for error in error_range]
>>> import matplotlib.pyplot as plt
>>> plt.plot(error_range, ens_errors, 
...          label='Ensemble error', 
...          linewidth=2)
>>> plt.plot(error_range, error_range, 
...          linestyle='--', label='Base error',
...          linewidth=2)
>>> plt.xlabel('Base error')
>>> plt.ylabel('Base/Ensemble error')
>>> plt.legend(loc='upper left')
>>> plt.grid()
>>> plt.show()

As we can see in the resulting plot, the error probability of an ensemble is always better than the error of an individual base classifier as long as the base classifiers perform better than random guessing (Learning with ensembles). Note that the y-axis depicts the base error (dotted line) as well as the ensemble error (continuous line):

Learning with ensembles

Implementing a simple majority vote classifier

After the short introduction to ensemble learning in the previous section, let's start with a warm-up exercise and implement a simple ensemble classifier for majority voting in Python. Although the following algorithm also generalizes to multi-class settings via plurality voting, we will use the term majority voting for simplicity as is also often done in literature.

The algorithm that we are going to implement will allow us to combine different classification algorithms associated with individual weights for confidence. Our goal is to build a stronger meta-classifier that balances out the individual classifiers' weaknesses on a particular dataset. In more precise mathematical terms, we can write the weighted majority vote as follows:

Implementing a simple majority vote classifier

Here, Implementing a simple majority vote classifier is a weight associated with a base classifier, Implementing a simple majority vote classifier, Implementing a simple majority vote classifier is the predicted class label of the ensemble, Implementing a simple majority vote classifier (Greek chi) is the characteristic function Implementing a simple majority vote classifier, and A is the set of unique class labels. For equal weights, we can simplify this equation and write it as follows:

Implementing a simple majority vote classifier

To better understand the concept of weighting, we will now take a look at a more concrete example. Let's assume that we have an ensemble of three base classifiers Implementing a simple majority vote classifier (Implementing a simple majority vote classifier and want to predict the class label of a given sample instance x. Two out of three base classifiers predict the class label 0, and one Implementing a simple majority vote classifier predicts that the sample belongs to class 1. If we weight the predictions of each base classifier equally, the majority vote will predict that the sample belongs to class 0:

Implementing a simple majority vote classifier
Implementing a simple majority vote classifier

Now let's assign a weight of 0.6 to Implementing a simple majority vote classifier and weight Implementing a simple majority vote classifier and Implementing a simple majority vote classifier by a coefficient of 0.2, respectively.

Implementing a simple majority vote classifier
Implementing a simple majority vote classifier

More intuitively, since Implementing a simple majority vote classifier, we can say that the prediction made by Implementing a simple majority vote classifier has three times more weight than the predictions by Implementing a simple majority vote classifier or Implementing a simple majority vote classifier, respectively. We can write this as follows:

Implementing a simple majority vote classifier

To translate the concept of the weighted majority vote into Python code, we can use NumPy's convenient argmax and bincount functions:

>>> import numpy as np
>>> np.argmax(np.bincount([0, 0, 1], 
...           weights=[0.2, 0.2, 0.6]))
1

As discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, certain classifiers in scikit-learn can also return the probability of a predicted class label via the predict_proba method. Using the predicted class probabilities instead of the class labels for majority voting can be useful if the classifiers in our ensemble are well calibrated. The modified version of the majority vote for predicting class labels from probabilities can be written as follows:

Implementing a simple majority vote classifier

Here, Implementing a simple majority vote classifier is the predicted probability of the jth classifier for class label i.

To continue with our previous example, let's assume that we have a binary classification problem with class labels Implementing a simple majority vote classifier and an ensemble of three classifiers Implementing a simple majority vote classifier(Implementing a simple majority vote classifier). Let's assume that the classifier Implementing a simple majority vote classifier returns the following class membership probabilities for a particular sample Implementing a simple majority vote classifier:

Implementing a simple majority vote classifier

We can then calculate the individual class probabilities as follows:

Implementing a simple majority vote classifier
Implementing a simple majority vote classifier
Implementing a simple majority vote classifier

To implement the weighted majority vote based on class probabilities, we can again make use of NumPy using numpy.average and np.argmax:

>>> ex = np.array([[0.9, 0.1],
...                [0.8, 0.2],
...                [0.4, 0.6]])
>>> p = np.average(ex, axis=0, weights=[0.2, 0.2, 0.6])
>>> p
array([ 0.58,  0.42])
>>> np.argmax(p)
0

Putting everything together, let's now implement a MajorityVoteClassifier in Python:

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(BaseEstimator,
                             ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'}
      Default: 'classlabel'
      If 'classlabel' the prediction is based on
      the argmax of class labels. Else if
      'probability', the argmax of the sum of
      probabilities is used to predict the class label
      (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers]
      Optional, default: None
      If a list of `int` or `float` values are
      provided, the classifiers are weighted by
      importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers,
                 vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for
                                  key, value in
                                  _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights

    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix},
            shape = [n_samples, n_features]
            Matrix of training samples.

        y : array-like, shape = [n_samples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        # Use LabelEncoder to ensure class labels start
        # with 0, which is important for np.argmax
        # call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X,
                              self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self

I added a lot of comments to the code to better understand the individual parts. However, before we implement the remaining methods, let's take a quick break and discuss some of the code that may look confusing at first. We used the parent classes BaseEstimator and ClassifierMixin to get some base functionality for free, including the methods get_params and set_params to set and return the classifier's parameters as well as the score method to calculate the prediction accuracy, respectively. Also note that we imported six to make the MajorityVoteClassifier compatible with Python 2.7.

Next we will add the predict method to predict the class label via majority vote based on the class labels if we initialize a new MajorityVoteClassifier object with vote='classlabel'. Alternatively, we will be able to initialize the ensemble classifier with vote='probability' to predict the class label based on the class membership probabilities. Furthermore, we will also add a predict_proba method to return the average probabilities, which is useful to compute the Receiver Operator Characteristic area under the curve (ROC AUC).

    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix},
            Shape = [n_samples, n_features]
            Matrix of training samples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.

        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X),
                                 axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in
                                      self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                           lambda x:
                           np.argmax(np.bincount(x,                                             
                                        weights=self.weights)),
                           axis=1,
                           arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix},
            shape = [n_samples, n_features]
            Training vectors, where n_samples is
            the number of samples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like,
            shape = [n_samples, n_classes]
            Weighted average probability for
            each class per sample.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, 
                               axis=0, weights=self.weights)
        return avg_proba

    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier,
                         self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in\ 
                    six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(
                        step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out

Also, note that we defined our own modified version of the get_params methods to use the _name_estimators function in order to access the parameters of individual classifiers in the ensemble. This may look a little bit complicated at first, but it will make perfect sense when we use grid search for hyperparameter-tuning in later sections.

Note

Although our MajorityVoteClassifier implementation is very useful for demonstration purposes, I also implemented a more sophisticated version of the majority vote classifier in scikit-learn. It will become available as sklearn.ensemble.VotingClassifier in the next release version (v0.17).

Combining different algorithms for classification with majority vote

Now it is about time to put the MajorityVoteClassifier that we implemented in the previous section into action. But first, let's prepare a dataset that we can test it on. Since we are already familiar with techniques to load datasets from CSV files, we will take a shortcut and load the Iris dataset from scikit-learn's dataset module. Furthermore, we will only select two features, sepal width and petal length, to make the classification task more challenging. Although our MajorityVoteClassifier generalizes to multiclass problems, we will only classify flower samples from the two classes, Iris-Versicolor and Iris-Virginica, to compute the ROC AUC. The code is as follows:

>>> from sklearn import datasets
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.preprocessing import LabelEncoder
>>> iris = datasets.load_iris()
>>> X, y = iris.data[50:, [1, 2]], iris.target[50:]
>>> le = LabelEncoder()
>>> y = le.fit_transform(y)

Note

Note that scikit-learn uses the predict_proba method (if applicable) to compute the ROC AUC score. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we saw how the class probabilities are computed in logistic regression models. In decision trees, the probabilities are calculated from a frequency vector that is created for each node at training time. The vector collects the frequency values of each class label computed from the class label distribution at that node. Then the frequencies are normalized so that they sum up to 1. Similarly, the class labels of the k-nearest neighbors are aggregated to return the normalized class label frequencies in the k-nearest neighbors algorithm. Although the normalized probabilities returned by both the decision tree and k-nearest neighbors classifier may look similar to the probabilities obtained from a logistic regression model, we have to be aware that these are actually not derived from probability mass functions.

Next we split the Iris samples into 50 percent training and 50 percent test data:

>>> X_train, X_test, y_train, y_test =\
...        train_test_split(X, y, 
...                         test_size=0.5, 
...                         random_state=1)

Using the training dataset, we now will train three different classifiers—a logistic regression classifier, a decision tree classifier, and a k-nearest neighbors classifier—and look at their individual performances via a 10-fold cross-validation on the training dataset before we combine them into an ensemble classifier:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.neighbors import KNeighborsClassifier 
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> clf1 = LogisticRegression(penalty='l2', 
...                           C=0.001, 
...                           random_state=0)
>>> clf2 = DecisionTreeClassifier(max_depth=1, 
...                               criterion='entropy', 
...                               random_state=0)
>>> clf3 = KNeighborsClassifier(n_neighbors=1, 
...                             p=2, 
...                             metric='minkowski')
>>> pipe1 = Pipeline([['sc', StandardScaler()],
...                   ['clf', clf1]])
>>> pipe3 = Pipeline([['sc', StandardScaler()],
...                   ['clf', clf3]])
>>> clf_labels = ['Logistic Regression', 'Decision Tree', 'KNN']
>>> print('10-fold cross validation:\n')
>>> for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
...     scores = cross_val_score(estimator=clf, 
>>>                              X=X_train, 
>>>                              y=y_train, 
>>>                              cv=10, 
>>>                              scoring='roc_auc')
>>>     print("ROC AUC: %0.2f (+/- %0.2f) [%s]" 
...                % (scores.mean(), scores.std(), label))

The output that we receive, as shown in the following snippet, shows that the predictive performances of the individual classifiers are almost equal:

10-fold cross validation:

ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]

You may be wondering why we trained the logistic regression and k-nearest neighbors classifier as part of a pipeline. The reason behind it is that, as discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, both the logistic regression and k-nearest neighbors algorithms (using the Euclidean distance metric) are not scale-invariant in contrast with decision trees. Although the Iris features are all measured on the same scale (cm), it is a good habit to work with standardized features.

Now let's move on to the more exciting part and combine the individual classifiers for majority rule voting in our MajorityVoteClassifier:

>>> mv_clf = MajorityVoteClassifier(
...                 classifiers=[pipe1, clf2, pipe3])
>>> clf_labels += ['Majority Voting']
>>> all_clf = [pipe1, clf2, pipe3, mv_clf]
>>> for clf, label in zip(all_clf, clf_labels):
...     scores = cross_val_score(estimator=clf, 
...                              X=X_train, 
...                              y=y_train, 
...                              cv=10, 
...                              scoring='roc_auc')
...     print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
...                % (scores.mean(), scores.std(), label))
ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]
ROC AUC: 0.97 (+/- 0.10) [Majority Voting]

As we can see, the performance of the MajorityVotingClassifier has substantially improved over the individual classifiers in the 10-fold cross-validation evaluation.

Combining different algorithms for classification with majority vote

Now it is about time to put the MajorityVoteClassifier that we implemented in the previous section into action. But first, let's prepare a dataset that we can test it on. Since we are already familiar with techniques to load datasets from CSV files, we will take a shortcut and load the Iris dataset from scikit-learn's dataset module. Furthermore, we will only select two features, sepal width and petal length, to make the classification task more challenging. Although our MajorityVoteClassifier generalizes to multiclass problems, we will only classify flower samples from the two classes, Iris-Versicolor and Iris-Virginica, to compute the ROC AUC. The code is as follows:

>>> from sklearn import datasets
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.preprocessing import LabelEncoder
>>> iris = datasets.load_iris()
>>> X, y = iris.data[50:, [1, 2]], iris.target[50:]
>>> le = LabelEncoder()
>>> y = le.fit_transform(y)

Note

Note that scikit-learn uses the predict_proba method (if applicable) to compute the ROC AUC score. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we saw how the class probabilities are computed in logistic regression models. In decision trees, the probabilities are calculated from a frequency vector that is created for each node at training time. The vector collects the frequency values of each class label computed from the class label distribution at that node. Then the frequencies are normalized so that they sum up to 1. Similarly, the class labels of the k-nearest neighbors are aggregated to return the normalized class label frequencies in the k-nearest neighbors algorithm. Although the normalized probabilities returned by both the decision tree and k-nearest neighbors classifier may look similar to the probabilities obtained from a logistic regression model, we have to be aware that these are actually not derived from probability mass functions.

Next we split the Iris samples into 50 percent training and 50 percent test data:

>>> X_train, X_test, y_train, y_test =\
...        train_test_split(X, y, 
...                         test_size=0.5, 
...                         random_state=1)

Using the training dataset, we now will train three different classifiers—a logistic regression classifier, a decision tree classifier, and a k-nearest neighbors classifier—and look at their individual performances via a 10-fold cross-validation on the training dataset before we combine them into an ensemble classifier:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.neighbors import KNeighborsClassifier 
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> clf1 = LogisticRegression(penalty='l2', 
...                           C=0.001, 
...                           random_state=0)
>>> clf2 = DecisionTreeClassifier(max_depth=1, 
...                               criterion='entropy', 
...                               random_state=0)
>>> clf3 = KNeighborsClassifier(n_neighbors=1, 
...                             p=2, 
...                             metric='minkowski')
>>> pipe1 = Pipeline([['sc', StandardScaler()],
...                   ['clf', clf1]])
>>> pipe3 = Pipeline([['sc', StandardScaler()],
...                   ['clf', clf3]])
>>> clf_labels = ['Logistic Regression', 'Decision Tree', 'KNN']
>>> print('10-fold cross validation:\n')
>>> for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
...     scores = cross_val_score(estimator=clf, 
>>>                              X=X_train, 
>>>                              y=y_train, 
>>>                              cv=10, 
>>>                              scoring='roc_auc')
>>>     print("ROC AUC: %0.2f (+/- %0.2f) [%s]" 
...                % (scores.mean(), scores.std(), label))

The output that we receive, as shown in the following snippet, shows that the predictive performances of the individual classifiers are almost equal:

10-fold cross validation:

ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]

You may be wondering why we trained the logistic regression and k-nearest neighbors classifier as part of a pipeline. The reason behind it is that, as discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, both the logistic regression and k-nearest neighbors algorithms (using the Euclidean distance metric) are not scale-invariant in contrast with decision trees. Although the Iris features are all measured on the same scale (cm), it is a good habit to work with standardized features.

Now let's move on to the more exciting part and combine the individual classifiers for majority rule voting in our MajorityVoteClassifier:

>>> mv_clf = MajorityVoteClassifier(
...                 classifiers=[pipe1, clf2, pipe3])
>>> clf_labels += ['Majority Voting']
>>> all_clf = [pipe1, clf2, pipe3, mv_clf]
>>> for clf, label in zip(all_clf, clf_labels):
...     scores = cross_val_score(estimator=clf, 
...                              X=X_train, 
...                              y=y_train, 
...                              cv=10, 
...                              scoring='roc_auc')
...     print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
...                % (scores.mean(), scores.std(), label))
ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]
ROC AUC: 0.97 (+/- 0.10) [Majority Voting]

As we can see, the performance of the MajorityVotingClassifier has substantially improved over the individual classifiers in the 10-fold cross-validation evaluation.

Evaluating and tuning the ensemble classifier

In this section, we are going to compute the ROC curves from the test set to check if the MajorityVoteClassifier generalizes well to unseen data. We should remember that the test set is not to be used for model selection; its only purpose is to report an unbiased estimate of the generalization performance of a classifier system. The code is as follows:

>>> from sklearn.metrics import roc_curve
>>> from sklearn.metrics import auc
>>> colors = ['black', 'orange', 'blue', 'green']
>>> linestyles = [':', '--', '-.', '-']
>>> for clf, label, clr, ls \
...         in zip(all_clf, clf_labels, colors, linestyles):
...     # assuming the label of the positive class is 1
...     y_pred = clf.fit(X_train, 
...                      y_train).predict_proba(X_test)[:, 1]
...     fpr, tpr, thresholds = roc_curve(y_true=y_test, 
...                                      y_score=y_pred)
...     roc_auc = auc(x=fpr, y=tpr)
...     plt.plot(fpr, tpr, 
...              color=clr, 
...              linestyle=ls, 
...              label='%s (auc = %0.2f)' % (label, roc_auc))
>>> plt.legend(loc='lower right')
>>> plt.plot([0, 1], [0, 1], 
...          linestyle='--', 
...          color='gray', 
...          linewidth=2)
>>> plt.xlim([-0.1, 1.1])
>>> plt.ylim([-0.1, 1.1])
>>> plt.grid()
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.show()

As we can see in the resulting ROC, the ensemble classifier also performs well on the test set (ROC AUC = 0.95), whereas the k-nearest neighbors classifier seems to be overfitting the training data (training ROC AUC = 0.93, test ROC AUC = 0.86):

Evaluating and tuning the ensemble classifier

Since we only selected two features for the classification examples, it would be interesting to see what the decision region of the ensemble classifier actually looks like. Although it is not necessary to standardize the training features prior to model fitting because our logistic regression and k-nearest neighbors pipelines will automatically take care of this, we will standardize the training set so that the decision regions of the decision tree will be on the same scale for visual purposes. The code is as follows:

>>> sc = StandardScaler()
>>> X_train_std = sc.fit_transform(X_train)
>>> from itertools import product
>>> x_min = X_train_std[:, 0].min() - 1
>>> x_max = X_train_std[:, 0].max() + 1
>>> y_min = X_train_std[:, 1].min() - 1
>>> y_max = X_train_std[:, 1].max() + 1
>>> xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
...                      np.arange(y_min, y_max, 0.1))
>>> f, axarr = plt.subplots(nrows=2, ncols=2, 
...                         sharex='col', 
...                         sharey='row', 
...                         figsize=(7, 5))
>>> for idx, clf, tt in zip(product([0, 1], [0, 1]),
...                         all_clf, clf_labels):
...     clf.fit(X_train_std, y_train)
...     Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
...     Z = Z.reshape(xx.shape)
...     axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.3)    
...     axarr[idx[0], idx[1]].scatter(X_train_std[y_train==0, 0], 
...                                   X_train_std[y_train==0, 1], 
...                                   c='blue', 
...                                   marker='^',
...                                   s=50)    
...     axarr[idx[0], idx[1]].scatter(X_train_std[y_train==1, 0], 
...                                   X_train_std[y_train==1, 1], 
...                                   c='red', 
...                                   marker='o',
...                                   s=50)   
...     axarr[idx[0], idx[1]].set_title(tt)
>>> plt.text(-3.5, -4.5, 
...          s='Sepal width [standardized]', 
...          ha='center', va='center', fontsize=12)
>>> plt.text(-10.5, 4.5, 
...          s='Petal length [standardized]', 
...          ha='center', va='center', 
...          fontsize=12, rotation=90)
>>> plt.show()

Interestingly but also as expected, the decision regions of the ensemble classifier seem to be a hybrid of the decision regions from the individual classifiers. At first glance, the majority vote decision boundary looks a lot like the decision boundary of the k-nearest neighbor classifier. However, we can see that it is orthogonal to the y axis for Evaluating and tuning the ensemble classifier, just like the decision tree stump:

Evaluating and tuning the ensemble classifier

Before you learn how to tune the individual classifier parameters for ensemble classification, let's call the get_params method to get a basic idea of how we can access the individual parameters inside a GridSearch object:

>>> mv_clf.get_params()
{'decisiontreeclassifier': DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
             max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             random_state=0, splitter='best'),
 'decisiontreeclassifier__class_weight': None,
 'decisiontreeclassifier__criterion': 'entropy',
 [...]
 'decisiontreeclassifier__random_state': 0,
 'decisiontreeclassifier__splitter': 'best',
 'pipeline-1': Pipeline(steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr',
           penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
           verbose=0))]),
 'pipeline-1__clf': LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr',
           penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
           verbose=0),
 'pipeline-1__clf__C': 0.001,
 'pipeline-1__clf__class_weight': None,
 'pipeline-1__clf__dual': False,
 [...]
 'pipeline-1__sc__with_std': True,
 'pipeline-2': Pipeline(steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_neighbors=1, p=2, weights='uniform'))]),
 'pipeline-2__clf': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_neighbors=1, p=2, weights='uniform'),
 'pipeline-2__clf__algorithm': 'auto',
 [...]
 'pipeline-2__sc__with_std': True}

Based on the values returned by the get_params method, we now know how to access the individual classifier's attributes. Let's now tune the inverse regularization parameter C of the logistic regression classifier and the decision tree depth via a grid search for demonstration purposes. The code is as follows:

>>> from sklearn.grid_search import GridSearchCV
>>> params = {'decisiontreeclassifier__max_depth': [1, 2],
...           'pipeline-1__clf__C': [0.001, 0.1, 100.0]}
>>> grid = GridSearchCV(estimator=mv_clf, 
...                     param_grid=params, 
...                     cv=10, 
...                     scoring='roc_auc')
>>> grid.fit(X_train, y_train)

After the grid search has completed, we can print the different hyperparameter value combinations and the average ROC AUC scores computed via 10-fold cross-validation. The code is as follows:

>>> for params, mean_score, scores in grid.grid_scores_:
...     print("%0.3f+/-%0.2f %r"
...            % (mean_score, scores.std() / 2, params))
0.967+/-0.05 {'pipeline-1__clf__C': 0.001, 'decisiontreeclassifier__max_depth': 1}
0.967+/-0.05 {'pipeline-1__clf__C': 0.1, 'decisiontreeclassifier__max_depth': 1}
1.000+/-0.00 {'pipeline-1__clf__C': 100.0, 'decisiontreeclassifier__max_depth': 1}
0.967+/-0.05 {'pipeline-1__clf__C': 0.001, 'decisiontreeclassifier__max_depth': 2}
0.967+/-0.05 {'pipeline-1__clf__C': 0.1, 'decisiontreeclassifier__max_depth': 2}
1.000+/-0.00 {'pipeline-1__clf__C': 100.0, 'decisiontreeclassifier__max_depth': 2}

>>> print('Best parameters: %s' % grid.best_params_)
Best parameters: {'pipeline-1__clf__C': 100.0, 'decisiontreeclassifier__max_depth': 1}

>>> print('Accuracy: %.2f' % grid.best_score_)
Accuracy: 1.00

As we can see, we get the best cross-validation results when we choose a lower regularization strength (C = 100.0) whereas the tree depth does not seem to affect the performance at all, suggesting that a decision stump is sufficient to separate the data. To remind ourselves that it is a bad practice to use the test dataset more than once for model evaluation, we are not going to estimate the generalization performance of the tuned hyperparameters in this section. We will move on swiftly to an alternative approach for ensemble learning: bagging.

Note

The majority vote approach we implemented in this section is sometimes also referred to as stacking. However, the stacking algorithm is more typically used in combination with a logistic regression model that predicts the final class label using the predictions of the individual classifiers in the ensemble as input, which has been described in more detail by David H. Wolpert in D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.

Bagging – building an ensemble of classifiers from bootstrap samples

Bagging is an ensemble learning technique that is closely related to the MajorityVoteClassifier that we implemented in the previous section, as illustrated in the following diagram:

Bagging – building an ensemble of classifiers from bootstrap samples

However, instead of using the same training set to fit the individual classifiers in the ensemble, we draw bootstrap samples (random samples with replacement) from the initial training set, which is why bagging is also known as bootstrap aggregating. To provide a more concrete example of how bootstrapping works, let's consider the example shown in the following figure. Here, we have seven different training instances (denoted as indices 1-7) that are sampled randomly with replacement in each round of bagging. Each bootstrap sample is then used to fit a classifier Bagging – building an ensemble of classifiers from bootstrap samples, which is most typically an unpruned decision tree:

Bagging – building an ensemble of classifiers from bootstrap samples

Bagging is also related to the random forest classifier that we introduced in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. In fact, random forests are a special case of bagging where we also use random feature subsets to fit the individual decision trees. Bagging was first proposed by Leo Breiman in a technical report in 1994; he also showed that bagging can improve the accuracy of unstable models and decrease the degree of overfitting. I highly recommend you read about his research in L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, 1996, which is freely available online, to learn more about bagging.

To see bagging in action, let's create a more complex classification problem using the Wine dataset that we introduced in Chapter 4, Building Good Training Sets – Data Preprocessing. Here, we will only consider the Wine classes 2 and 3, and we select two features: Alcohol and Hue.

>>> import pandas as pd
>>> df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
>>> df_wine.columns = ['Class label', 'Alcohol', 
...                    'Malic acid', 'Ash', 
...                    'Alcalinity of ash', 
...                    'Magnesium', 'Total phenols', 
...                    'Flavanoids', 'Nonflavanoid phenols',
...                    'Proanthocyanins', 
...                    'Color intensity', 'Hue', 
...                    'OD280/OD315 of diluted wines', 
...                    'Proline']
>>> df_wine = df_wine[df_wine['Class label'] != 1]
>>> y = df_wine['Class label'].values
>>> X = df_wine[['Alcohol', 'Hue']].values

Next we encode the class labels into binary format and split the dataset into 60 percent training and 40 percent test set, respectively:

>>> from sklearn.preprocessing import LabelEncoder
>>> from sklearn.cross_validation import train_test_split
>>> le = LabelEncoder()
>>> y = le.fit_transform(y)
>>> X_train, X_test, y_train, y_test =\
...            train_test_split(X, y, 
...                             test_size=0.40, 
...                             random_state=1)

A BaggingClassifier algorithm is already implemented in scikit-learn, which we can import from the ensemble submodule. Here, we will use an unpruned decision tree as the base classifier and create an ensemble of 500 decision trees fitted on different bootstrap samples of the training dataset:

>>> from sklearn.ensemble import BaggingClassifier
>>> tree = DecisionTreeClassifier(criterion='entropy', 
...                               max_depth=None)
>>> bag = BaggingClassifier(base_estimator=tree,
...                         n_estimators=500, 
...                         max_samples=1.0, 
...                         max_features=1.0, 
...                         bootstrap=True, 
...                         bootstrap_features=False, 
...                         n_jobs=1, 
...                         random_state=1)

Next we will calculate the accuracy score of the prediction on the training and test dataset to compare the performance of the bagging classifier to the performance of a single unpruned decision tree:

>>> from sklearn.metrics import accuracy_score
>>> tree = tree.fit(X_train, y_train)
>>> y_train_pred = tree.predict(X_train)
>>> y_test_pred = tree.predict(X_test)
>>> tree_train = accuracy_score(y_train, y_train_pred)
>>> tree_test = accuracy_score(y_test, y_test_pred)
>>> print('Decision tree train/test accuracies %.3f/%.3f'
...        % (tree_train, tree_test))
Decision tree train/test accuracies 1.000/0.854

Based on the accuracy values that we printed by executing the preceding code snippet, the unpruned decision tree predicts all class labels of the training samples correctly; however, the substantially lower test accuracy indicates high variance (overfitting) of the model:

>>> bag = bag.fit(X_train, y_train)
>>> y_train_pred = bag.predict(X_train)
>>> y_test_pred = bag.predict(X_test)
>>> bag_train = accuracy_score(y_train, y_train_pred) 
>>> bag_test = accuracy_score(y_test, y_test_pred) 
>>> print('Bagging train/test accuracies %.3f/%.3f'
...        % (bag_train, bag_test))
Bagging train/test accuracies 1.000/0.896

Although the training accuracies of the decision tree and bagging classifier are similar on the training set (both 1.0), we can see that the bagging classifier has a slightly better generalization performance as estimated on the test set. Next let's compare the decision regions between the decision tree and bagging classifier:

>>> x_min = X_train[:, 0].min() - 1
>>> x_max = X_train[:, 0].max() + 1
>>> y_min = X_train[:, 1].min() - 1
>>> y_max = X_train[:, 1].max() + 1
>>> xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
...                      np.arange(y_min, y_max, 0.1))
>>> f, axarr = plt.subplots(nrows=1, ncols=2, 
...                         sharex='col', 
...                         sharey='row', 
...                         figsize=(8, 3))
>>> for idx, clf, tt in zip([0, 1],
...                         [tree, bag],
...                         ['Decision Tree', 'Bagging']):
...     clf.fit(X_train, y_train)
...     
...     Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
...     Z = Z.reshape(xx.shape)
...     axarr[idx].contourf(xx, yy, Z, alpha=0.3)
...     axarr[idx].scatter(X_train[y_train==0, 0], 
...                        X_train[y_train==0, 1], 
...                        c='blue', marker='^')    
...     axarr[idx].scatter(X_train[y_train==1, 0], 
...                        X_train[y_train==1, 1], 
...                        c='red', marker='o')    
...     axarr[idx].set_title(tt)
>>> axarr[0].set_ylabel(Alcohol', fontsize=12)
>>> plt.text(10.2, -1.2, 
...          s=Hue', 
...          ha='center', va='center', fontsize=12)
>>> plt.show()

As we can see in the resulting plot, the piece-wise linear decision boundary of the three-node deep decision tree looks smoother in the bagging ensemble:

Bagging – building an ensemble of classifiers from bootstrap samples

We only looked at a very simple bagging example in this section. In practice, more complex classification tasks and datasets' high dimensionality can easily lead to overfitting in single decision trees and this is where the bagging algorithm can really play out its strengths. Finally, we shall note that the bagging algorithm can be an effective approach to reduce the variance of a model. However, bagging is ineffective in reducing model bias, which is why we want to choose an ensemble of classifiers with low bias, for example, unpruned decision trees.

Leveraging weak learners via adaptive boosting

In this section about ensemble methods, we will discuss boosting with a special focus on its most common implementation, AdaBoost (short for Adaptive Boosting).

Note

The original idea behind AdaBoost was formulated by Robert Schapire in 1990 (R. E. Schapire. The Strength of Weak Learnability. Machine learning, 5(2):197–227, 1990). After Robert Schapire and Yoav Freund presented the AdaBoost algorithm in the Proceedings of the Thirteenth International Conference (ICML 1996), AdaBoost became one of the most widely used ensemble methods in the years that followed (Y. Freund, R. E. Schapire, et al. Experiments with a New Boosting Algorithm. In ICML, volume 96, pages 148–156, 1996). In 2003, Freund and Schapire received the Goedel Prize for their groundbreaking work, which is a prestigious prize for the most outstanding publications in the computer science field.

In boosting, the ensemble consists of very simple base classifiers, also often referred to as weak learners, that have only a slight performance advantage over random guessing. A typical example of a weak learner would be a decision tree stump. The key concept behind boosting is to focus on training samples that are hard to classify, that is, to let the weak learners subsequently learn from misclassified training samples to improve the performance of the ensemble. In contrast to bagging, the initial formulation of boosting, the algorithm uses random subsets of training samples drawn from the training dataset without replacement. The original boosting procedure is summarized in four key steps as follows:

  1. Draw a random subset of training samples Leveraging weak learners via adaptive boosting without replacement from the training set Leveraging weak learners via adaptive boosting to train a weak learner Leveraging weak learners via adaptive boosting.
  2. Draw second random training subset Leveraging weak learners via adaptive boosting without replacement from the training set and add 50 percent of the samples that were previously misclassified to train a weak learner Leveraging weak learners via adaptive boosting.
  3. Find the training samples Leveraging weak learners via adaptive boosting in the training set Leveraging weak learners via adaptive boosting on which Leveraging weak learners via adaptive boosting and Leveraging weak learners via adaptive boosting disagree to train a third weak learner Leveraging weak learners via adaptive boosting.
  4. Combine the weak learners Leveraging weak learners via adaptive boosting, Leveraging weak learners via adaptive boosting, and Leveraging weak learners via adaptive boosting via majority voting.

As discussed by Leo Breiman (L. Breiman. Bias, Variance, and Arcing Classifiers. 1996), boosting can lead to a decrease in bias as well as variance compared to bagging models. In practice, however, boosting algorithms such as AdaBoost are also known for their high variance, that is, the tendency to overfit the training data (G. Raetsch, T. Onoda, and K. R. Mueller. An Improvement of Adaboost to Avoid Overfitting. In Proc. of the Int. Conf. on Neural Information Processing. Citeseer, 1998).

In contrast to the original boosting procedure as described here, AdaBoost uses the complete training set to train the weak learners where the training samples are reweighted in each iteration to build a strong classifier that learns from the mistakes of the previous weak learners in the ensemble. Before we dive deeper into the specific details of the AdaBoost algorithm, let's take a look at the following figure to get a better grasp of the basic concept behind AdaBoost:

Leveraging weak learners via adaptive boosting

To walk through the AdaBoost illustration step by step, we start with subfigure 1, which represents a training set for binary classification where all training samples are assigned equal weights. Based on this training set, we train a decision stump (shown as a dashed line) that tries to classify the samples of the two classes (triangles and circles) as well as possible by minimizing the cost function (or the impurity score in the special case of decision tree ensembles). For the next round (subfigure 2), we assign a larger weight to the two previously misclassified samples (circles). Furthermore, we lower the weight of the correctly classified samples. The next decision stump will now be more focused on the training samples that have the largest weights, that is, the training samples that are supposedly hard to classify. The weak learner shown in subfigure 2 misclassifies three different samples from the circle-class, which are then assigned a larger weight as shown in subfigure 3. Assuming that our AdaBoost ensemble only consists of three rounds of boosting, we would then combine the three weak learners trained on different reweighted training subsets by a weighted majority vote, as shown in subfigure 4.

Now that have a better understanding behind the basic concept of AdaBoost, let's take a more detailed look at the algorithm using pseudo code. For clarity, we will denote element-wise multiplication by the cross symbol Leveraging weak learners via adaptive boosting and the dot product between two vectors by a dot symbol Leveraging weak learners via adaptive boosting, respectively. The steps are as follows:

  1. Set weight vector Leveraging weak learners via adaptive boosting to uniform weights where Leveraging weak learners via adaptive boosting
  2. For j in m boosting rounds, do the following:
  3. Train a weighted weak learner: Leveraging weak learners via adaptive boosting).
  4. Predict class labels: Leveraging weak learners via adaptive boosting.
  5. Compute weighted error rate: Leveraging weak learners via adaptive boosting.
  6. Compute coefficient: Leveraging weak learners via adaptive boosting.
  7. Update weights: Leveraging weak learners via adaptive boosting.
  8. Normalize weights to sum to 1: Leveraging weak learners via adaptive boosting.
  9. Compute final prediction: Leveraging weak learners via adaptive boosting.

Note that the expression Leveraging weak learners via adaptive boosting in step 5 refers to a vector of 1s and 0s, where a 1 is assigned if the prediction is correct and 0 is assigned otherwise.

Although the AdaBoost algorithm seems to be pretty straightforward, let's walk through a more concrete example using a training set consisting of 10 training samples as illustrated in the following table:

Leveraging weak learners via adaptive boosting

The first column of the table depicts the sample indices of the training samples 1 to 10. In the second column, we see the feature values of the individual samples assuming this is a one-dimensional dataset. The third column shows the true class label Leveraging weak learners via adaptive boosting for each training sample Leveraging weak learners via adaptive boosting, where Leveraging weak learners via adaptive boosting. The initial weights are shown in the fourth column; we initialize the weights to uniform and normalize them to sum to one. In the case of the 10 sample training set, we therefore assign the 0.1 to each weight Leveraging weak learners via adaptive boosting in the weight vector Leveraging weak learners via adaptive boosting. The predicted class labels Leveraging weak learners via adaptive boosting are shown in the fifth column, assuming that our splitting criterion is Leveraging weak learners via adaptive boosting. The last column of the table then shows the updated weights based on the update rules that we defined in the pseudocode.

Since the computation of the weight updates may look a little bit complicated at first, we will now follow the calculation step by step. We start by computing the weighted error rate Leveraging weak learners via adaptive boosting as described in step 5:

Leveraging weak learners via adaptive boosting

Next we compute the coefficient Leveraging weak learners via adaptive boosting (shown in step 6), which is later used in step 7 to update the weights as well as for the weights in majority vote prediction (step 10):

Leveraging weak learners via adaptive boosting

After we have computed the coefficient Leveraging weak learners via adaptive boosting we can now update the weight vector using the following equation:

Leveraging weak learners via adaptive boosting

Here, Leveraging weak learners via adaptive boosting is an element-wise multiplication between the vectors of the predicted and true class labels, respectively. Thus, if a prediction Leveraging weak learners via adaptive boosting is correct, Leveraging weak learners via adaptive boosting will have a positive sign so that we decrease the ith weight since Leveraging weak learners via adaptive boosting is a positive number as well:

Leveraging weak learners via adaptive boosting

Similarly, we will downweight the ith weight if Leveraging weak learners via adaptive boosting predicted the label incorrectly like this:

Leveraging weak learners via adaptive boosting

Or like this:

Leveraging weak learners via adaptive boosting

After we update each weight in the weight vector, we normalize the weights so that they sum up to 1 (step 8):

Leveraging weak learners via adaptive boosting

Here, Leveraging weak learners via adaptive boosting.

Thus, each weight that corresponds to a correctly classified sample will be reduced from the initial value of 0.1 to Leveraging weak learners via adaptive boosting for the next round of boosting. Similarly, the weights of each incorrectly classified sample will increase from 0.1 to Leveraging weak learners via adaptive boosting.

This was AdaBoost in a nutshell. Skipping to the more practical part, let's now train an AdaBoost ensemble classifier via scikit-learn. We will use the same Wine subset that we used in the previous section to train the bagging meta-classifier. Via the base_estimator attribute, we will train the AdaBoostClassifier on 500 decision tree stumps:

>>> from sklearn.ensemble import AdaBoostClassifier
>>> tree = DecisionTreeClassifier(criterion='entropy', 
...                               max_depth=1)
>>> ada = AdaBoostClassifier(base_estimator=tree,
...                          n_estimators=500, 
...                          learning_rate=0.1,
...                          random_state=0)
>>> tree = tree.fit(X_train, y_train)
>>> y_train_pred = tree.predict(X_train)
>>> y_test_pred = tree.predict(X_test)
>>> tree_train = accuracy_score(y_train, y_train_pred)
>>> tree_test = accuracy_score(y_test, y_test_pred)
>>> print('Decision tree train/test accuracies %.3f/%.3f'
...       % (tree_train, tree_test))
Decision tree train/test accuracies 0.845/0.854

As we can see, the decision tree stump seems to overfit the training data in contrast with the unpruned decision tree that we saw in the previous section:

>>> ada = ada.fit(X_train, y_train)
>>> y_train_pred = ada.predict(X_train)
>>> y_test_pred = ada.predict(X_test)
>>> ada_train = accuracy_score(y_train, y_train_pred) 
>>> ada_test = accuracy_score(y_test, y_test_pred) 
>>> print('AdaBoost train/test accuracies %.3f/%.3f'
...       % (ada_train, ada_test))
AdaBoost train/test accuracies 1.000/0.875

As we can see, the AdaBoost model predicts all class labels of the training set correctly and also shows a slightly improved test set performance compared to the decision tree stump. However, we also see that we introduced additional variance by our attempt to reduce the model bias.

Although we used another simple example for demonstration purposes, we can see that the performance of the AdaBoost classifier is slightly improved compared to the decision stump and achieved very similar accuracy scores to the bagging classifier that we trained in the previous section. However, we should note that it is considered as bad practice to select a model based on the repeated usage of the test set. The estimate of the generalization performance may be too optimistic, which we discussed in more detail in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.

Finally, let's check what the decision regions look like:

>>> x_min = X_train[:, 0].min() - 1
>>> x_max = X_train[:, 0].max() + 1
>>> y_min = X_train[:, 1].min() - 1
>>> y_max = X_train[:, 1].max() + 1
>>> xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
...                      np.arange(y_min, y_max, 0.1))
>>> f, axarr = plt.subplots(1, 2, 
...                         sharex='col', 
...                         sharey='row', 
...                         figsize=(8, 3))
>>> for idx, clf, tt in zip([0, 1],
...                         [tree, ada],
...                         ['Decision Tree', 'AdaBoost']):
...     clf.fit(X_train, y_train)   
...     Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
...     Z = Z.reshape(xx.shape)
...     axarr[idx].contourf(xx, yy, Z, alpha=0.3)
...     axarr[idx].scatter(X_train[y_train==0, 0], 
...                        X_train[y_train==0, 1], 
...                        c='blue', 
...                        marker='^')
...     axarr[idx].scatter(X_train[y_train==1, 0], 
...                        X_train[y_train==1, 1], 
...                        c='red',
...                        marker='o')
... axarr[idx].set_title(tt)
... axarr[0].set_ylabel('Alcohol', fontsize=12)
>>> plt.text(10.2, -1.2, 
...          s=Hue', 
...          ha='center', 
...          va='center', 
...          fontsize=12)    
>>> plt.show()

By looking at the decision regions, we can see that the decision boundary of the AdaBoost model is substantially more complex than the decision boundary of the decision stump. In addition, we note that the AdaBoost model separates the feature space very similarly to the bagging classifier that we trained in the previous section.

Leveraging weak learners via adaptive boosting

As concluding remarks about ensemble techniques, it is worth noting that ensemble learning increases the computational complexity compared to individual classifiers. In practice, we need to think carefully whether we want to pay the price of increased computational costs for an often relatively modest improvement of predictive performance.

An often-cited example of this trade-off is the famous $1 Million Netflix Prize, which was won using ensemble techniques. The details about the algorithm were published in A. Toescher, M. Jahrer, and R. M. Bell. The Bigchaos Solution to the Netflix Grand Prize. Netflix prize documentation, 2009 (which is available at http://www.stat.osu.edu/~dmsl/GrandPrize2009_BPC_BigChaos.pdf). Although the winning team received the $1 million prize money, Netflix never implemented their model due to its complexity, which made it unfeasible for a real-world application. To quote their exact words (http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html):

"[…] additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment."

Summary

In this chapter, we looked at some of the most popular and widely used techniques for ensemble learning. Ensemble methods combine different classification models to cancel out their individual weakness, which often results in stable and well-performing models that are very attractive for industrial applications as well as machine learning competitions.

In the beginning of this chapter, we implemented a MajorityVoteClassifier in Python that allows us to combine different algorithm for classification. We then looked at bagging, a useful technique to reduce the variance of a model by drawing random bootstrap samples from the training set and combining the individually trained classifiers via majority vote. Then we discussed AdaBoost, which is an algorithm that is based on weak learners that subsequently learn from mistakes.

Throughout the previous chapters, we discussed different learning algorithms, tuning, and evaluation techniques. In the following chapter, we will look at a particular application of machine learning, sentiment analysis, which has certainly become an interesting topic in the era of the Internet and social media.