Book Image

Hands-On Ensemble Learning with Python

By : George Kyriakides, Konstantinos G. Margaritis
Book Image

Hands-On Ensemble Learning with Python

By: George Kyriakides, Konstantinos G. Margaritis

Overview of this book

Ensembling is a technique of combining two or more similar or dissimilar machine learning algorithms to create a model that delivers superior predictive power. This book will demonstrate how you can use a variety of weak algorithms to make a strong predictive model. With its hands-on approach, you'll not only get up to speed with the basic theory but also the application of different ensemble learning techniques. Using examples and real-world datasets, you'll be able to produce better machine learning models to solve supervised learning problems such as classification and regression. In addition to this, you'll go on to leverage ensemble learning techniques such as clustering to produce unsupervised machine learning models. As you progress, the chapters will cover different machine learning algorithms that are widely used in the practical world to make predictions and classifications. You'll even get to grips with the use of Python libraries such as scikit-learn and Keras for implementing different ensemble models. By the end of this book, you will be well-versed in ensemble learning, and have the skills you need to understand which ensemble method is required for which problem, and successfully implement them in real-world scenarios.
Table of Contents (20 chapters)
Free Chapter
1
Section 1: Introduction and Required Software Tools
4
Section 2: Non-Generative Methods
7
Section 3: Generative Methods
11
Section 4: Clustering
13
Section 5: Real World Applications

Learning from data

Data is the raw ingredient of machine learning. Processing data can produce information; for example, measuring the height of a portion of a school's students (data) and calculating their average (processing) can give us an idea of the whole school's height (information). If we process the data further, for example, by grouping males and females and calculating two averages – one for each group, we will gain more information, as we will have an idea about the average height of the school's males and females. Machine learning strives to produce the most information possible from any given data. In this example, we produced a very basic predictive model. By calculating the two averages, we can predict the average height of any student just by knowing whether the student is male or female.

The set of data that a machine learning algorithm is tasked with processing is called the problem's dataset. In our example, the dataset consists of height measurements (in centimeters) and the child's sex (male/female). In machine learning, input variables are called features and output variables are called targets. In this dataset, the features of our predictive model consist solely of the students' sex, while our target is the students' height in centimeters. The predictive model that is produced and maps features to targets will be referred to as simply the model from now on, unless otherwise specified. Each data point is called an instance. In this problem, each student is an instance of the dataset.

When the target is a continuous variable (a number), it presents a regression problem, as the aim is to regress the target on the features. When the target is a set of categories, it presents a classification problem, as we try to assign each instance to a category or class.

Note that, in classification problems, the target class can be represented by a number; this does not mean that it is a regression problem. The most useful way to determine whether it is a regression problem is to think about whether the instances can be ordered by their targets. In our example, the target is height, so we can order the students from tallest to shortest, as 100 cm is less than 110 cm. As a counter example, if the target was their favorite color, we could represent each color by a number, but we could not order them. Even if we represented red as one and blue as two, we could not say that red is "before" or "less than" blue. Thus, this counter example is a classification problem.

Popular machine learning datasets

Machine learning relies on data in order to produce high-performing models. Without data, it's not even possible to create models. In this section, we'll present some popular machine learning datasets, which we will utilize throughout this book.

Diabetes

The diabetes dataset concerns 442 individual diabetes patients and the progression of the disease one year after a baseline measurement. The dataset consists of 10 features, which are the patient's age, sex, body mass index (bmi), average blood pressure (bp), and six measurements of their blood serum. The dataset target is the progression of the disease one year after the baseline measurement. This is a regression dataset, as the target is a number.

In this book, the dataset features are mean-centered and scaled such that the dataset sum of squares for each feature equals one. The following table depicts a sample of the diabetes dataset:

age

sex

bmi

bp

s1

s2

s3

s4

s5

s6

target

0.04

0.05

0.06

0.02

-0.04

-0.03

-0.04

0.00

0.02

-0.02

151

0.00

-0.04

-0.05

-0.03

-0.01

-0.02

0.07

-0.04

-0.07

-0.09

75

0.09

0.05

0.04

-0.01

-0.05

-0.03

-0.03

0.00

0.00

-0.03

141

-0.09

-0.04

-0.01

-0.04

0.01

0.02

-0.04

0.03

0.02

-0.01

206

Breast cancer

The breast cancer dataset concerns 569 biopsies of malignant and benign tumors. The dataset provides 30 features extracted from images of fine-needle aspiration biopsies that describe cell nuclei. The images provide information about the shape, size, and texture of each cell nucleus. Furthermore, for each characteristic, three distinct values are provided. The mean, the standard error, and the worst or largest value. This ensures that, for each image, the cell population is adequately described.

The dataset target concerns the diagnosis, that is, whether a tumor is malignant or benign. Thus, this is a classification dataset. The available features are listed as follows:

  • Mean radius
  • Mean texture
  • Mean perimeter
  • Mean area
  • Mean smoothness
  • Mean compactness
  • Mean concavity
  • Mean concave points
  • Mean symmetry
  • Mean fractal dimension
  • Radius error
  • Texture error
  • Perimeter error
  • Area error
  • Smoothness error
  • Compactness error
  • Concavity error
  • Concave points error
  • Symmetry error
  • Fractal dimension error
  • Worst radius
  • Worst texture
  • Worst perimeter
  • Worst area
  • Worst smoothness
  • Worst compactness
  • Worst concavity
  • Worst concave points
  • Worst symmetry
  • Worst fractal dimension

Handwritten digits

The MNIST handwritten digit dataset is one of the most famous image recognition datasets. It consists of square images, 8 x 8 pixels, each containing a single handwritten digit. Thus, the dataset features are an 8 by 8 matrix, containing each pixel's color in grayscale. The target consists of 10 classes, one for each digit from 0 to 9. This is a classification dataset. The following figure is a sample from the handwritten digit dataset:

Sample of the handwritten digit dataset