Book Image

Hands-On Ensemble Learning with Python

By : George Kyriakides, Konstantinos G. Margaritis
Book Image

Hands-On Ensemble Learning with Python

By: George Kyriakides, Konstantinos G. Margaritis

Overview of this book

Ensembling is a technique of combining two or more similar or dissimilar machine learning algorithms to create a model that delivers superior predictive power. This book will demonstrate how you can use a variety of weak algorithms to make a strong predictive model. With its hands-on approach, you'll not only get up to speed with the basic theory but also the application of different ensemble learning techniques. Using examples and real-world datasets, you'll be able to produce better machine learning models to solve supervised learning problems such as classification and regression. In addition to this, you'll go on to leverage ensemble learning techniques such as clustering to produce unsupervised machine learning models. As you progress, the chapters will cover different machine learning algorithms that are widely used in the practical world to make predictions and classifications. You'll even get to grips with the use of Python libraries such as scikit-learn and Keras for implementing different ensemble models. By the end of this book, you will be well-versed in ensemble learning, and have the skills you need to understand which ensemble method is required for which problem, and successfully implement them in real-world scenarios.
Table of Contents (20 chapters)
Free Chapter
1
Section 1: Introduction and Required Software Tools
4
Section 2: Non-Generative Methods
7
Section 3: Generative Methods
11
Section 4: Clustering
13
Section 5: Real World Applications

Machine learning algorithms

There are a number of machine learning algorithms, for both supervised and unsupervised learning. In this book, we will cover some of the most popular algorithms that can be utilized within ensembles. In this chapter, we will go over the key concepts behind each algorithm, the basic algorithms, and the libraries that implement them in Python.

Python packages

In order to leverage the power of any programming language, libraries are essential. They provide a convenient and tested implementation of many algorithms. In this book, we will be using Python 3.6 along with the following libraries: NumPy, for its excellent implementation of numerical operators and matrices; Pandas, for its convenient data manipulation methods; Matplotlib, to visualize our data; scikit-learn, for its excellent implementations of various machine learning algorithms, and Keras to build neural networks, utilizing its Pythonic, intuitive interface. Keras is an interface for other frameworks, such as TensorFlow, PyTorch, and Theano. The specific versions of each library used in this book are listed as follows:

  • numpy==1.15.1
  • pandas==0.23.4
  • scikit-learn==0.19.1
  • matplotlib==2.2.2
  • Keras==2.2.4

Supervised learning algorithms

The most common class of machine learning algorithm is supervised learning algorithms. These concern problems where data has a known structure. This means that each data point has a specific value related to it that we wish to model or predict.

Regression

Regression is one of the simplest machine learning algorithms. The Ordinary Least Squares (OLS) regression of the form y=ax+b attempts to optimize the a and b parameters in order to fit the data. It uses MSE as its cost function. As the name implies, it is able to solve regression problems.

We can use the scikit-learn implementation of OLS to try and model the diabetes dataset (the dataset is provided with the library):

# --- SECTION 1 ---
# Libraries and data loading
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn import metrics
diabetes = load_diabetes()

The first section deals with importing libraries and loading data. We use the LinearRegression implementation that exists in the linear_model package:

# --- SECTION 2 ---
# Split the data into train and test set
train_x, train_y = diabetes.data[:400], diabetes.target[:400]
test_x, test_y = diabetes.data[400:], diabetes.target[400:]

The second section splits the data into a train and a test set. For this example, we used the first 400 instances as the train set and the other 42 as the test set:

# --- SECTION 3 ---
# Instantiate, train and evaluate the model
ols = LinearRegression()
ols.fit(train_x, train_y)
err = metrics.mean_squared_error(test_y, ols.predict(test_x))
r2 = metrics.r2_score(test_y, ols.predict(test_x))

The next section instantiates a linear regression object with ols = LinearRegression(). It then optimizes the parameters, or fits the model with our training instances, using ols.fit(train_x, train_y). Finally, by using the metrics package, we calculate the MSE and R2 of our model, using the test data in Section 4:

# --- SECTION 4 ---
# Print the model
print('---OLS on diabetes dataset.---')
print('Coefficients:')
print('Intercept (b): %.2f'%ols.intercept_)
for i in range(len(diabetes.feature_names)):
print(diabetes.feature_names[i]+': %.2f'%ols.coef_[i])
print('-'*30)
print('R-squared: %.2f'%r2, ' MSE: %.2f \n'%err)

The code's output is the following:

---OLS on diabetes dataset.---
Coefficients:
Intercept (b): 152.73
age: 5.03
sex: -238.41
bmi: 521.63
bp: 299.94
s1: -752.12
s2: 445.15
s3: 83.51
s4: 185.58
s5: 706.47
s6: 88.68
------------------------------
R-squared: 0.70 MSE: 1668.75

Another form of regression, logistic regression, attempts to model the probability that an instance belongs to one of two classes. Again, it attempts to optimize the a and b parameters in order to model p=1/(1+e-(ax+b)) . Once again, using scikit-learn and the breast cancer dataset, we can create and evaluate a simple logistic regression. The following code sections are similar to the preceding ones, but this time we'll use classification accuracy and a confusion matrix rather than R2 as a metric:

# --- SECTION 1 ---
# Libraries and data loading
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn import metrics
bc = load_breast_cancer()

# --- SECTION 2 ---
# Split the data into train and test set
train_x, train_y = bc.data[:400], bc.target[:400]
test_x, test_y = bc.data[400:], bc.target[400:]

# --- SECTION 3 ---
# Instantiate, train and evaluate the model
logit = LogisticRegression()
logit.fit(train_x, train_y)
acc = metrics.accuracy_score(test_y, logit.predict(test_x))

# --- SECTION 4 ---
# Print the model
print('---Logistic Regression on breast cancer dataset.---')
print('Coefficients:')
print('Intercept (b): %.2f'%logit.intercept_)
for i in range(len(bc.feature_names)):
print(bc.feature_names[i]+': %.2f'%logit.coef_[0][i])
print('-'*30)
print('Accuracy: %.2f \n'%acc)
print(metrics.confusion_matrix(test_y, logit.predict(test_x)))

The test classification accuracy achieved with this model is 95%, which is quite good. Furthermore, the confusion matrix that follows here indicates that the model does not try to take advantage of class imbalances. Later in this book, we will learn how to further increase the classification accuracy with the use of ensemble methods. The following table shows the logit model confusion matrix:

n = 169

Predicted: Malignant

Predicted: Benign

Target: Malignant

38

1

Target: Benign

8

122

Support vector machines

Support vector machines or SVMs use a subset of training data, specifically data points near the edge of each class, in order to define a separating hyperplane (in two dimensions, a line). These edge cases are called support vectors. The goal of an SVM is to find the hyperplane that maximizes the margin (distance) between the support vectors (depicted in the following figure). In order to classify nonlinear separable classes, SVMs use the kernel trick to map data in a higher dimensional space, where it can become linearly separable:

SVM margins and support vectors

If you want to learn more about the kernel trick, this is a good starting point: https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick.

In scikit-learn, an SVM is implemented under sklearn.svm, both for regression with sklearn.svm.SVR and classification with sklearn.svm.SVC. Once again, we'll test the algorithm's potential using scikit-learn and the code utilized in the regression examples. Using an SVM with a linear kernel on the breast cancer dataset results in 95% accuracy and the following confusion matrix:

n = 169

Predicted: Malignant

Predicted: Benign

Target: Malignant

39

0

Target: Benign

9

121

On the diabetes dataset, by fine-tuning the C parameter to 1,000 during the (svr = SVR(kernel='linear', C=1e3)) object instantiation, we are able to achieve an R2 of 0.71 and an MSE of 1622.36, marginally better than the logit model.

Neural networks

Neural networks, inspired by the way biological brains are connected, consist of many neurons, or computational modules, organized in layers. Data is provided at the input layer and predictions are produced at the output layer. All intermediate layers are called hidden layers. Neurons that belong to the same layer are not connected to each other, only to neurons that belong in other layers. Each neuron can have multiple inputs, where each input is multiplied by a specific weight and the sum of multiplied inputs is passed to an activation function that defines the neuron's output. Common activation functions include the following:

Sigmoid Tanh ReLU Linear

The network's goal is to optimize each neuron's weights, such that the cost function is minimized. Neural networks can be either used for regression, where the output layer consists of a single neuron, or classification, where it consists of many neurons, usually equal to the number of classes. There are a number of optimizing algorithms or optimizers available for neural networks. The most common is stochastic gradient descent or SGD. The main idea is that the weights are updated based on the direction and magnitude (first derivative) of the error's gradient, multiplied by a factor called the learning rate.

Variations and extensions have been proposed that take into account the second derivative, adapt the learning rate, or use the momentum of previous weight changes to update the weights.

Although the concept of neural networks has existed for a long time, recently their popularity has greatly increased with the advent of deep learning. Modern architectures consist of convolutional layers, where each layer's weights consist of matrices, and the output is calculated by sliding the weight matrix onto the input. Another type of layers, max pooling layers, calculates the output as the maximum input element again by sliding a fixed-size window onto the input. Recurrent layers retain information about their previous
states. Finally, fully connected layers are traditional neurons, as described previously.

Scikit-learn implements traditional neural networks, under the sklearn.neural_network package. Once again, using the preceding examples, we'll try to model the diabetes and breast cancer datasets. On the diabetes dataset, we'll use MLPRegressor with Stochastic Gradient Descent (SGD) as the optimizer, with mlpr = MLPRegressor(solver='sgd'). Without any further fine-tuning, we achieve an R2 of 0.64 and an MSE of 1977. On the breast cancer dataset, using the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) optimizer, with mlpc = MLPClassifier(solver='lbfgs'), we get a classification accuracy of 93% and a competent confusion matrix. The following table shows the neural network confusion matrix for the breast cancer dataset:

n = 169

Predicted: Malignant

Predicted: Benign

Target: Malignant

35

4

Target: Benign

8

122

A very important note on neural networks: the initial weights of a network are randomly initialized. Thus, the same code can perform differently if it is executed several times. In order to ensure non-random (non-stochastic) execution, the initial random state of the network must be fixed. The two scikit-learn classes implement this feature through the random_state parameter in the object constructor. In order to set the random state to a specific seed value, the constructor must be called as follows: mlpc = MLPClassifier(solver='lbfgs', random_state=12418).

Decision trees

Decision trees are less of a black box than other machine learning algorithms. They can easily explain how they produce a prediction, which is called interpretability. The main concept is that they produce rules by splitting the training set using the provided features. By iteratively splitting the data, a tree form is produced, thus this is where their name derives from. Let's consider a dataset where the instances are individual persons deciding on their vacations.

The dataset features consist of the person's age and available money, while the target is their preferred destination, one of either Summer Camp, Lake, or Bahamas. A possible decision tree model is depicted in the following figure:

Decision tree model for the vacation destination problem

As is evident, the model can explain how it produces any predictions. The way that the model itself is built is by trying to select the feature and threshold that maximize the information produced. Roughly, this means that the model will try to iteratively split the dataset in a way that separates the greatest number of remaining instances.

Although intuitive to understand, decision trees can produce unreasonable models, with the extreme being the generation of so many rules that, eventually, each rule combination leads to a single instance. In order to avoid such models, we can restrict the model by requiring that it does not exceed a specific depth (maximum number of consecutive rules), or that each node has at least a minimum number of instances before it can be further split.

In scikit-learn, decision trees are implemented under the sklearn.tree package, with DecisionTreeClassifier and DecisionTreeRegressor. In our examples, using DecisionTreeRegressor with dtr = DecisionTreeRegressor(max_depth=2), we achieve an R2 of 0.52 and an MSE of 2655. On the breast cancer dataset, using dtc = DecisionTreeClassifier(max_depth=2), we achieve 89% accuracy and the following confusion matrix:

n = 169

Predicted: Malignant

Predicted: Benign

Target: Malignant

37

2

Target: Benign

17

113

Although not the best-performing algorithm so far, we can clearly see how each individual was classified, by exporting the tree to the graphviz format with export_graphviz(dtc, feature_names=bc.feature_names, class_names=bc.target_names, impurity=False):

The decision tree generated for the breast cancer dataset

K-Nearest Neighbors

k-Nearest Neighbors (k-NN) is a relatively simple machine learning algorithm. Each instance is classified by comparing it to its K-nearest examples as the majority class. In regression, the average value of neighbors is used. Scikit-learn's implementation lies within the sklearn.neighbors package of the library. As it is the naming convention of the library, KNeighborsClassifier implements the classification and KNeighborsRegressor implements the regression version of the algorithm. Using them in our examples, the regressor generates an R2 of 0.58 with an MSE of 2342, while the classifier achieves 93% accuracy. The following table shows the k-NN confusion matrix for the breast cancer dataset:

n = 169

Predicted: Malignant

Predicted: Benign

Target: Malignant

37

2

Target: Benign

9

121

K-means

K-means is a clustering algorithm that presents similarities to k-NN. A number of cluster centers are produced, and each instance is assigned to its nearest cluster. After all instances are assigned to a cluster, the centroid of the cluster becomes the new center, until the algorithm converges to a stable solution. In scikit-learn, this algorithm is implemented in sklearn.cluster.KMeans. We can try to cluster the first two features of the breast cancer dataset: the mean radius and the texture of the FNA imaging.

First, we load the required data and libraries, while retaining only the first two features of the dataset:

# --- SECTION 1 ---
# Libraries and data loading
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
bc = load_breast_cancer()
bc.data=bc.data[:,:2]

Then, we fit the cluster on the data. Note that we don't have to split the data into train and test sets:

# --- SECTION 2 ---
# Instantiate and train
km = KMeans(n_clusters=3)
km.fit(bc.data)

Following that, we create a two-dimensional mesh and cluster every point, in order to plot the cluster areas and boundaries:

# --- SECTION 3 ---
# Create a point mesh to plot cluster areas
# Step size of the mesh.
h = .02
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = bc.data[:, 0].min() - 1, bc.data[:, 0].max() + 1
y_min, y_max = bc.data[:, 1].min() - 1, bc.data[:, 1].max() + 1
# Create the actual mesh and cluster it
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = km.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
aspect='auto', origin='lower',)

Finally, we plot the actual data, color-mapped to its respective clusters:

 --- SECTION 4 ---
# Plot the actual data
c = km.predict(bc.data)
r = c == 0
b = c == 1
g = c == 2
plt.scatter(bc.data[r, 0], bc.data[r, 1], label='cluster 1')
plt.scatter(bc.data[b, 0], bc.data[b, 1], label='cluster 2')
plt.scatter(bc.data[g, 0], bc.data[g, 1], label='cluster 3')
plt.title('K-means')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.xlabel(bc.feature_names[0])
plt.ylabel(bc.feature_names[1])
`()
plt.show()

The result is a two-dimensional image with color-coded boundaries of each cluster, as well as the instances:

K-means clustering of the first two features of the breast cancer dataset