Book Image

Python: Deeper Insights into Machine Learning

By : David Julian, Sebastian Raschka, John Hearty
Book Image

Python: Deeper Insights into Machine Learning

By: David Julian, Sebastian Raschka, John Hearty

Overview of this book

Machine learning and predictive analytics are becoming one of the key strategies for unlocking growth in a challenging contemporary marketplace. It is one of the fastest growing trends in modern computing, and everyone wants to get into the field of machine learning. In order to obtain sufficient recognition in this field, one must be able to understand and design a machine learning system that serves the needs of a project. The idea is to prepare a learning path that will help you to tackle the real-world complexities of modern machine learning with innovative and cutting-edge techniques. Also, it will give you a solid foundation in the machine learning design process, and enable you to build customized machine learning models to solve unique problems. The course begins with getting your Python fundamentals nailed down. It focuses on answering the right questions that cove a wide range of powerful Python libraries, including scikit-learn Theano and Keras.After getting familiar with Python core concepts, it’s time to dive into the field of data science. You will further gain a solid foundation on the machine learning design and also learn to customize models for solving problems. At a later stage, you will get a grip on more advanced techniques and acquire a broad set of powerful skills in the area of feature selection and feature engineering.
Table of Contents (6 chapters)
4
A. Biblography
5
Index

Chapter 12. Training Artificial Neural Networks for Image Recognition

As you may know, deep learning is getting a lot of press and is without any doubt the hottest topic in the machine learning field. Deep learning can be understood as a set of algorithms that were developed to train artificial neural networks with many layers most efficiently. In this chapter, you will learn the basic concepts of artificial neural networks so that you will be well equipped to further explore the most exciting areas of research in the machine learning field, as well as the advanced Python-based deep learning libraries that are currently being developed.

The topics that we will cover are as follows:

  • Getting a conceptual understanding of multi-layer neural networks
  • Training neural networks for image classification
  • Implementing the powerful backpropagation algorithm
  • Debugging neural network implementations

Modeling complex functions with artificial neural networks

At the beginning of this book, we started our journey through machine learning algorithms with artificial neurons in Chapter 2, Training Machine Learning Algorithms for Classification. Artificial neurons represent the building blocks of the multi-layer artificial neural networks that we are going to discuss in this chapter. The basic concept behind artificial neural networks was built upon hypotheses and models of how the human brain works to solve complex problem tasks. Although artificial neural networks have gained a lot of popularity in recent years, early studies of neural networks go back to the 1940s when Warren McCulloch and Walter Pitt first described how neurons could work. However, in the decades that followed the first implementation of the McCulloch-Pitt neuron model, Rosenblatt's perceptron in the 1950s, many researchers and machine learning practitioners slowly began to lose interest in neural networks since no one had a good solution for training a neural network with multiple layers. Eventually, interest in neural networks was rekindled in 1986 when D.E. Rumelhart, G.E. Hinton, and R.J. Williams were involved in the (re)discovery and popularization of the backpropagation algorithm to train neural networks more efficiently, which we will discuss in more detail later in this chapter (Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). Learning Representations by Back-propagating Errors. Nature 323 (6088): 533–536).

During the previous decade, many more major breakthroughs resulted in what we now call deep learning algorithms, which can be used to create feature detectors from unlabeled data to pre-train deep neural networks—neural networks that are composed of many layers. Neural networks are a hot topic not only in academic research, but also in big technology companies such as Facebook, Microsoft, and Google who invest heavily in artificial neural networks and deep learning research. As of today, complex neural networks powered by deep learning algorithms are considered as state-of-the-art when it comes to complex problem solving such as image and voice recognition. Popular examples of the products in our everyday life that are powered by deep learning are Google's image search and Google Translate, an application for smartphones that can automatically recognize text in images for real-time translation into 20 languages (http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html).

Many more exciting applications of deep neural networks are under active development at major tech companies, for example, Facebook's DeepFace for tagging images (Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition CVPR, 2014 IEEE Conference, pages 1701–1708) and Baidu's DeepSpeech, which is able to handle voice queries in Mandarin (A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. DeepSpeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014). In addition, the pharmaceutical industry recently started to use deep learning techniques for drug discovery and toxicity prediction, and research has shown that these novel techniques substantially exceed the performance of traditional methods for virtual screening (T. Unterthiner, A. Mayr, G. Klambauer, and S. Hochreiter. Toxicity prediction using deep learning. arXiv preprint arXiv:1503.01445, 2015).

Single-layer neural network recap

This chapter is all about multi-layer neural networks, how they work, and how to train them to solve complex problems. However, before we dig deeper into a particular multi-layer neural network architecture, let's briefly reiterate some of the concepts of single-layer neural networks that we introduced in Chapter 2, Training Machine Learning Algorithms for Classification, namely, the ADAptive LInear NEuron (Adaline) algorithm that is shown in the following figure:

Single-layer neural network recap

In Chapter 2, Training Machine Learning Algorithms for Classification, we implemented the Adaline algorithm to perform binary classification, and we used a gradient descent optimization algorithm to learn the weight coefficients of the model. In every epoch (pass over the training set), we updated the weight vector Single-layer neural network recap using the following update rule:

Single-layer neural network recap

In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient Single-layer neural network recap. In order to find the optimal weights of the model, we optimized an objective function that we defined as the Sum of Squared Errors (SSE) cost function Single-layer neural network recap. Furthermore, we multiplied the gradient by a factor, the learning rate Single-layer neural network recap, which we chose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.

In gradient descent optimization, we updated all weights simultaneously after each epoch, and we defined the partial derivative for each weight Single-layer neural network recap in the weight vector Single-layer neural network recap as follows:

Single-layer neural network recap

Here Single-layer neural network recap is the target class label of a particular sample Single-layer neural network recap, and Single-layer neural network recap is the activation of the neuron, which is a linear function in the special case of Adaline. Furthermore, we defined the activation function Single-layer neural network recap as follows:

Single-layer neural network recap

Here, the net input Single-layer neural network recap is a linear combination of the weights that are connecting the input to the output layer:

Single-layer neural network recap

While we used the activation Single-layer neural network recap to compute the gradient update, we implemented a threshold function (Heaviside function) to squash the continuous-valued output into binary class labels for prediction:

Single-layer neural network recap

Note

Note that although Adaline consists of two layers, one input layer and one output layer, it is called a single-layer network because of its single link between the input and output layers.

Introducing the multi-layer neural network architecture

In this section, we will see how to connect multiple single neurons to a multi-layer feedforward neural network; this special type of network is also called a multi-layer perceptron (MLP). The following figure explains the concept of an MLP consisting of three layers: one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer, respectively. If such a network has more than one hidden layer, we also call it a deep artificial neural network.

Introducing the multi-layer neural network architecture

Note

We could add an arbitrary number of hidden layers to the MLP to create deeper network architectures. Practically, we can think of the number of layers and units in a neural network as additional hyperparameters that we want to optimize for a given problem task using the cross-validation that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.

However, the error gradients that we will calculate later via backpropagation would become increasingly small as more layers are added to a network. This vanishing gradient problem makes the model learning more challenging. Therefore, special algorithms have been developed to pretrain such deep neural network structures, which is called deep learning.

As shown in the preceding figure, we denote the Introducing the multi-layer neural network architectureth activation unit in the Introducing the multi-layer neural network architectureth layer as Introducing the multi-layer neural network architecture, and the activation units Introducing the multi-layer neural network architecture and Introducing the multi-layer neural network architecture are the bias units, respectively, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit:

Introducing the multi-layer neural network architecture

Each unit in layer Introducing the multi-layer neural network architecture is connected to all units in layer Introducing the multi-layer neural network architecture via a weight coefficient. For example, the connection between the Introducing the multi-layer neural network architectureth unit in layer Introducing the multi-layer neural network architecture to the Introducing the multi-layer neural network architectureth unit in layer Introducing the multi-layer neural network architecture would be written as Introducing the multi-layer neural network architecture. Please note that the superscript Introducing the multi-layer neural network architecture in Introducing the multi-layer neural network architecture stands for the Introducing the multi-layer neural network architectureth sample, not the Introducing the multi-layer neural network architectureth layer. In the following paragraphs, we will often omit the superscript Introducing the multi-layer neural network architecture for clarity.

While one unit in the output layer would suffice for a binary classification task, we saw a more general form of a neural network in the preceding figure, which allows us to perform multi-class classification via a generalization of the One-vs-All (OvA) technique. To better understand how this works, remember the one-hot representation of categorical variables that we introduced in Chapter 4, Building Good Training Sets – Data Preprocessing. For example, we would encode the three class labels in the familiar Iris dataset (0=Setosa, 1=Versicolor, 2=Virginica) as follows:

Introducing the multi-layer neural network architecture

This one-hot vector representation allows us to tackle classification tasks with an arbitrary number of unique class labels present in the training set.

If you are new to neural network representations, the terminology around the indices (subscripts and superscripts) may look a little bit confusing at first. You may wonder why we wrote Introducing the multi-layer neural network architecture and not Introducing the multi-layer neural network architecture to refer to the weight coefficient that connects the

Introducing the multi-layer neural network architecture

th unit in layer Introducing the multi-layer neural network architecture to the Introducing the multi-layer neural network architectureth unit in layer Introducing the multi-layer neural network architecture. What may seem a little bit quirky at first will make much more sense in later sections when we vectorize the neural network representation. For example, we will summarize the weights that connect the input and hidden layer by a matrix Introducing the multi-layer neural network architecture, where Introducing the multi-layer neural network architecture is the number of hidden units and Introducing the multi-layer neural network architecture is the number of hidden units plus bias unit. Since it is important to internalize this notation to follow the concepts later in this chapter, let's summarize what we just discussed in a descriptive illustration of a simplified 3-4-3 multi-layer perceptron:

Introducing the multi-layer neural network architecture

Activating a neural network via forward propagation

In this section, we will describe the process of forward propagation to calculate the output of an MLP model. To understand how it fits into the context of learning an MLP model, let's summarize the MLP learning procedure in three simple steps:

  1. Starting at the input layer, we forward propagate the patterns of the training data through the network to generate an output.
  2. Based on the network's output, we calculate the error that we want to minimize using a cost function that we will describe later.
  3. We backpropagate the error, find its derivative with respect to each weight in the network, and update the model.

Finally, after repeating the steps for multiple epochs and learning the weights of the MLP, we use forward propagation to calculate the network output and apply a threshold function to obtain the predicted class labels in the one-hot representation, which we described in the previous section.

Now, let's walk through the individual steps of forward propagation to generate an output from the patterns in the training data. Since each unit in the hidden unit is connected to all units in the input layers, we first calculate the activation Activating a neural network via forward propagation as follows:

Activating a neural network via forward propagation
Activating a neural network via forward propagation

Here, Activating a neural network via forward propagation is the net input and Activating a neural network via forward propagation is the activation function, which has to be differentiable to learn the weights that connect the neurons using a gradient-based approach. To be able to solve complex problems such as image classification, we need nonlinear activation functions in our MLP model, for example, the sigmoid (logistic) activation function that we used in logistic regression in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn:

Activating a neural network via forward propagation

As we can remember, the sigmoid function is an S-shaped curve that maps the net input Activating a neural network via forward propagation onto a logistic distribution in the range 0 to 1, which passes the origin at z = 0.5, as shown in the following graph:

Activating a neural network via forward propagation

The MLP is a typical example of a feedforward artificial neural network. The term feedforward refers to the fact that each layer serves as the input to the next layer without loops, in contrast to recurrent neural networks, an architecture that we will discuss later in this chapter. The term multi-layer perceptron may sound a little bit confusing, since the artificial neurons in this network architecture are typically sigmoid units, not perceptrons. Intuitively, we can think of the neurons in the MLP as logistic regression units that return values in the continuous range between 0 and 1.

For purposes of code efficiency and readability, we will now write the activation in a more compact form using the concepts of basic linear algebra, which will allow us to vectorize our code implementation via NumPy rather than writing multiple nested and expensive Python for loops:

Activating a neural network via forward propagation
Activating a neural network via forward propagation

Here, Activating a neural network via forward propagation is our Activating a neural network via forward propagation dimensional feature vector of a sample Activating a neural network via forward propagation plus bias unit. Activating a neural network via forward propagation is an Activating a neural network via forward propagation dimensional weight matrix where Activating a neural network via forward propagation is the number of hidden units in our neural network. After matrix-vector multiplication, we obtain the Activating a neural network via forward propagation dimensional net input vector Activating a neural network via forward propagation to calculate the activation Activating a neural network via forward propagation (where Activating a neural network via forward propagation). Furthermore, we can generalize this computation to all Activating a neural network via forward propagation samples in the training set:

Activating a neural network via forward propagation

Here, Activating a neural network via forward propagation is now an Activating a neural network via forward propagation matrix, and the matrix-matrix multiplication will result in a Activating a neural network via forward propagation dimensional net input matrix Activating a neural network via forward propagation. Finally, we apply the activation function Activating a neural network via forward propagation to each value in the net input matrix to get the Activating a neural network via forward propagation activation matrix Activating a neural network via forward propagation for the next layer (here, output layer):

Activating a neural network via forward propagation

Similarly, we can rewrite the activation of the output layer in the vectorized form:

Activating a neural network via forward propagation

Here, we multiply the Activating a neural network via forward propagation matrix Activating a neural network via forward propagation (t is the number of output units) by the Activating a neural network via forward propagation dimensional matrix Activating a neural network via forward propagation to obtain the Activating a neural network via forward propagation dimensional matrix Activating a neural network via forward propagation (the columns in this matrix represent the outputs for each sample).

Lastly, we apply the sigmoid activation function to obtain the continuous valued output of our network:

Activating a neural network via forward propagation

Classifying handwritten digits

In the previous section, we covered a lot of the theory around neural networks, which can be a little bit overwhelming if you are new to this topic. Before we continue with the discussion of the algorithm for learning the weights of the MLP model, backpropagation, let's take a short break from the theory and see a neural network in action.

Note

Neural network theory can be quite complex, thus I want to recommend two additional resources that cover some of the concepts that we discuss in this chapter in more detail:

T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical Learning, Volume 2. Springer, 2009.

C. M. Bishop et al. Pattern Recognition and Machine Learning, Volume 1. Springer New York, 2006.

In this section, we will train our first multi-layer neural network to classify handwritten digits from the popular MNIST dataset (short for Mixed National Institute of Standards and Technology database) that has been constructed by Yann LeCun et al. and serves as a popular benchmark dataset for machine learning algorithms (Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998).

Obtaining the MNIST dataset

The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and consists of the following four parts:

  • Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, and 60,000 samples)
  • Training set labels: train-labels-idx1-ubyte.gz (29 KB, 60 KB unzipped, and 60,000 labels)
  • Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, unzipped and 10,000 samples)
  • Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, and 10,000 labels)

The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.

After downloading the files, I recommend unzipping the files using the Unix/Linux gzip tool from the command line terminal for efficiency using the following command in your local MNIST download directory:

gzip *ubyte.gz -d

Alternatively, you could use your favorite unzipping tool if you are working with a machine running on Microsoft Windows. The images are stored in byte format, and we will read them into NumPy arrays that we will use to train and test our MLP implementation:

import os
import struct
import numpy as np

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path, 
                               '%s-labels-idx1-ubyte' 
                                % kind)
    images_path = os.path.join(path, 
                               '%s-images-idx3-ubyte' 
                               % kind)
        
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', 
                                 lbpath.read(8))
        labels = np.fromfile(lbpath, 
                             dtype=np.uint8)

    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", 
                                               imgpath.read(16))
        images = np.fromfile(imgpath, 
                    dtype=np.uint8).reshape(len(labels), 784)
 
    return images, labels

The load_mnist function returns two arrays, the first being an Obtaining the MNIST dataset dimensional NumPy array (images), where Obtaining the MNIST dataset is the number of samples and Obtaining the MNIST dataset is the number of features. The training dataset consists of 60,000 training digits and the test set contains 10,000 samples, respectively. The images in the MNIST dataset consist of Obtaining the MNIST dataset pixels, and each pixel is represented by a gray scale intensity value. Here, we unroll the Obtaining the MNIST dataset pixels into 1D row vectors, which represent the rows in our image array (784 per row or image). The second array (labels) returned by the load_mnist function contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits.

The way we read in the image might seem a little bit strange at first:

magic, n = struct.unpack('>II', lbpath.read(8))
labels = np.fromfile(lbpath, dtype=np.int8)

To understand how these two lines of code work, let's take a look at the dataset description from the MNIST website:

Obtaining the MNIST dataset

Using the two lines of the preceding code, we first read in the magic number, which is a description of the file protocol as well as the number of items (n) from the file buffer before we read the following bytes into a NumPy array using the fromfile method. The fmt parameter value >II that we passed as an argument to struct.unpack has two parts:

  • >: This is the big-endian (defines the order in which a sequence of bytes is stored); if you are unfamiliar with the terms big-endian and small-endian, you can find an excellent article about Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness).
  • I: This is an unsigned integer.

By executing the following code, we will now load the 60,000 training instances as well as the 10,000 test samples from the mnist directory where we unzipped the MNIST dataset:

>>> X_train, y_train = load_mnist('mnist', kind='train')
>>> print('Rows: %d, columns: %d' 
...        % (X_train.shape[0], X_train.shape[1]))
Rows: 60000, columns: 784

>>> X_test, y_test = load_mnist('mnist', kind='t10k')
>>> print('Rows: %d, columns: %d'
...        % (X_test.shape[0], X_test.shape[1]))
Rows: 10000, columns: 784

To get a idea what the images in MNIST look like, let's visualize examples of the digits 0-9 after reshaping the 784-pixel vectors from our feature matrix into the original 28 × 28 image that we can plot via matplotlib's imshow function:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(10):
...    img = X_train[y_train == i][0].reshape(28, 28)
...    ax[i].imshow(img, cmap='Greys', interpolation='nearest')
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

We should now see a plot of the Obtaining the MNIST dataset subfigures showing a representative image of each unique digit:

Obtaining the MNIST dataset

In addition, let's also plot multiple examples of the same digit to see how different those handwriting examples really are:

>>> fig, ax = plt.subplots(nrows=5, 
...                        ncols=5, 
...                        sharex=True, 
...                        sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(25):
...     img = X_train[y_train == 7][i].reshape(28, 28)
...     ax[i].imshow(img, cmap='Greys', interpolation='nearest')
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

After executing the code, we should now see the first 25 variants of the digit 7.

Obtaining the MNIST dataset

Optionally, we can save the MNIST image data and labels as CSV files to open them in programs that do not support their special byte format. However, we should be aware that the CSV file format will take up substantially more space on your local drive, as listed here:

  • train_img.csv: 109.5 MB
  • train_labels.csv: 120 KB
  • test_img.csv: 18.3 MB
  • test_labels: 20 KB

If we decide to save those CSV files, we can execute the following code in our Python session after loading the MNIST data into NumPy arrays:

>>> np.savetxt('train_img.csv', X_train, 
...            fmt='%i', delimiter=',')
>>> np.savetxt('train_labels.csv', y_train,
...            fmt='%i', delimiter=',')
>>> np.savetxt('test_img.csv', X_test,
...            fmt='%i', delimiter=',')
>>> np.savetxt('test_labels.csv', y_test, 
...            fmt='%i', delimiter=',')

Once we have saved the CSV files, we can load them back into Python using NumPy's genfromtxt function:

>>> X_train = np.genfromtxt('train_img.csv', 
...                         dtype=int, delimiter=',')
>>> y_train = np.genfromtxt('train_labels.csv',
...                         dtype=int, delimiter=',')
>>> X_test = np.genfromtxt('test_img.csv',
...                        dtype=int, delimiter=',')
>>> y_test = np.genfromtxt('test_labels.csv',
...                        dtype=int, delimiter=',')

However, it will take substantially longer to load the MNIST data from the CSV files, thus I recommend you stick to the original byte format if possible.

Implementing a multi-layer perceptron

In this subsection, we will now implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as possible. However, it may seem a little bit complicated at first, and I encourage you to download the sample code for this chapter from the Packt Publishing website, where you can find this MLP implementation annotated with comments and syntax highlighting for better readability. If you are not running the code from the accompanying IPython notebook, I recommend you copy it into a Python script file in your current working directory, for example, neuralnet.py, which you can then import into your current Python session via the following command:

from neuralnet import NeuralNetMLP

The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification, and the discussion of forward propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we will follow up on certain parts later in this chapter. However, going over the code at this stage can make it easier to follow the theory later.

import numpy as np
from scipy.special import expit
import sys

class NeuralNetMLP(object):
    def __init__(self, n_output, n_features, n_hidden=30,
                 l1=0.0, l2=0.0, epochs=500, eta=0.001, 
                 alpha=0.0, decrease_const=0.0, shuffle=True, 
                 minibatches=1, random_state=None):
        np.random.seed(random_state)
        self.n_output = n_output
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.w1, self.w2 = self._initialize_weights()
        self.l1 = l1
        self.l2 = l2
        self.epochs = epochs
        self.eta = eta
        self.alpha = alpha
        self.decrease_const = decrease_const
        self.shuffle = shuffle
        self.minibatches = minibatches

    def _encode_labels(self, y, k):
        onehot = np.zeros((k, y.shape[0]))
        for idx, val in enumerate(y):
            onehot[val, idx] = 1.0
        return onehot

    def _initialize_weights(self):
        w1 = np.random.uniform(-1.0, 1.0,       
                     size=self.n_hidden*(self.n_features + 1))
        w1 = w1.reshape(self.n_hidden, self.n_features + 1)
        w2 = np.random.uniform(-1.0, 1.0,
                     size=self.n_output*(self.n_hidden + 1))
        w2 = w2.reshape(self.n_output, self.n_hidden + 1)
        return w1, w2

    def _sigmoid(self, z):
        # expit is equivalent to 1.0/(1.0 + np.exp(-z))
        return expit(z)

    def _sigmoid_gradient(self, z):
        sg = self._sigmoid(z)
        return sg * (1 - sg)

    def _add_bias_unit(self, X, how='column'):
        if how == 'column':
            X_new = np.ones((X.shape[0], X.shape[1]+1))
            X_new[:, 1:] = X
        elif how == 'row':
            X_new = np.ones((X.shape[0]+1, X.shape[1]))
            X_new[1:, :] = X
        else:
            raise AttributeError('`how` must be `column` or `row`')
        return X_new

    def _feedforward(self, X, w1, w2):
        a1 = self._add_bias_unit(X, how='column')
        z2 = w1.dot(a1.T)
        a2 = self._sigmoid(z2)
        a2 = self._add_bias_unit(a2, how='row')
        z3 = w2.dot(a2)
        a3 = self._sigmoid(z3)
        return a1, z2, a2, z3, a3

    def _L2_reg(self, lambda_, w1, w2):
        return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2)\
                + np.sum(w2[:, 1:] ** 2))

    def _L1_reg(self, lambda_, w1, w2):
        return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum()\
                + np.abs(w2[:, 1:]).sum())

    def _get_cost(self, y_enc, output, w1, w2):
        term1 = -y_enc * (np.log(output))
        term2 = (1 - y_enc) * np.log(1 - output)
        cost = np.sum(term1 - term2)
        L1_term = self._L1_reg(self.l1, w1, w2)
        L2_term = self._L2_reg(self.l2, w1, w2)
        cost = cost + L1_term + L2_term
        return cost

    def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2):
        # backpropagation
        sigma3 = a3 - y_enc
        z2 = self._add_bias_unit(z2, how='row')
        sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2)
        sigma2 = sigma2[1:, :]
        grad1 = sigma2.dot(a1)
        grad2 = sigma3.dot(a2.T)

        # regularize
        grad1[:, 1:] += (w1[:, 1:] * (self.l1 + self.l2))
        grad2[:, 1:] += (w2[:, 1:] * (self.l1 + self.l2))

        return grad1, grad2

    def predict(self, X):
        a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2)
        y_pred = np.argmax(z3, axis=0)
        return y_pred

    def fit(self, X, y, print_progress=False):
        self.cost_ = []
        X_data, y_data = X.copy(), y.copy()
        y_enc = self._encode_labels(y, self.n_output)

        delta_w1_prev = np.zeros(self.w1.shape)
        delta_w2_prev = np.zeros(self.w2.shape)

        for i in range(self.epochs):

            # adaptive learning rate
            self.eta /= (1 + self.decrease_const*i)

            if print_progress:
                sys.stderr.write(
                        '\rEpoch: %d/%d' % (i+1, self.epochs))
                sys.stderr.flush()

            if self.shuffle:
                idx = np.random.permutation(y_data.shape[0])
                X_data, y_data = X_data[idx], y_data[idx]

            mini = np.array_split(range(
                         y_data.shape[0]), self.minibatches)
            for idx in mini:

                # feedforward
                a1, z2, a2, z3, a3 = self._feedforward(
                                     X[idx], self.w1, self.w2)
                cost = self._get_cost(y_enc=y_enc[:, idx],
                                      output=a3,
                                      w1=self.w1,
                                      w2=self.w2)
                self.cost_.append(cost)

                # compute gradient via backpropagation
                grad1, grad2 = self._get_gradient(a1=a1, a2=a2,
                                            a3=a3, z2=z2,
                                            y_enc=y_enc[:, idx],
                                            w1=self.w1,
                                            w2=self.w2)

                # update weights
                delta_w1, delta_w2 = self.eta * grad1,\
                                     self.eta * grad2
                self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev))
                self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev))
                delta_w1_prev, delta_w2_prev = delta_w1, delta_w2

        return self

Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features), 50 hidden units (n_hidden), and 10 output units (n_output):

>>> nn = NeuralNetMLP(n_output=10, 
...                   n_features=X_train.shape[1], 
...                   n_hidden=50, 
...                   l2=0.1, 
...                   l1=0.0, 
...                   epochs=1000, 
...                   eta=0.001,
...                   alpha=0.001,
...                   decrease_const=0.00001,
...                   shuffle=True,
...                   minibatches=50, 
...                   random_state=1)

As you may have noticed, by going over our preceding MLP implementation, we also implemented some additional features, which are summarized here:

  • l2: The Implementing a multi-layer perceptron parameter for L2 regularization to decrease the degree of overfitting; equivalently, l1 is the Implementing a multi-layer perceptron parameter for L1 regularization.
  • epochs: The number of passes over the training set.
  • eta: The learning rate Implementing a multi-layer perceptron.
  • alpha: A parameter for momentum learning to add a factor of the previous gradient to the weight update for faster learning Implementing a multi-layer perceptron (where Implementing a multi-layer perceptron is the current time step or epoch).
  • decrease_const: The decrease constant Implementing a multi-layer perceptron for an adaptive learning rate Implementing a multi-layer perceptron that decreases over time for better convergence Implementing a multi-layer perceptron.
  • shuffle: Shuffling the training set prior to every epoch to prevent the algorithm from getting stuck in cycles.
  • Minibatches: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning.

Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. Before you execute the following code, please note that training the neural network may take 10-30 minutes on standard desktop computer hardware:

>>> nn.fit(X_train, y_train, print_progress=True)
Epoch: 1000/1000

Similar to our previous Adaline implementation, we save the cost for each epoch in a cost_ list that we can now visualize, making sure that the optimization algorithm reached convergence. Here, we only plot every 50th step to account for the 50 mini-batches (50 mini-batches × 1000 epochs). The code is as follows:

>>> plt.plot(range(len(nn.cost_)), nn.cost_)
>>> plt.ylim([0, 2000])
>>> plt.ylabel('Cost')
>>> plt.xlabel('Epochs * 50')
>>> plt.tight_layout()
>>> plt.show()

As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent.

Implementing a multi-layer perceptron

Although we can already see in the plot that the optimization algorithm converged after approximately 800 epochs (40,000/50 = 800), let's plot a smoother version of the cost function against the number of epochs by averaging over the mini-batch intervals. The code is as follows:

>>> batches = np.array_split(range(len(nn.cost_)), 1000)
>>> cost_ary = np.array(nn.cost_)
>>> cost_avgs = [np.mean(cost_ary[i]) for i in batches]

>>> plt.plot(range(len(cost_avgs)),
...          cost_avgs, 
...          color='red')
>>> plt.ylim([0, 2000])
>>> plt.ylabel('Cost')
>>> plt.xlabel('Epochs')
>>> plt.tight_layout()
>>> plt.show()

The following plot gives us a clearer picture indicating that the training algorithm converged shortly after the 800th epoch:

Implementing a multi-layer perceptron

Now, let's evaluate the performance of the model by calculating the prediction accuracy:

>>> y_train_pred = nn.predict(X_train)
>>> acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0]
>>> print('Training accuracy: %.2f%%' % (acc * 100))
Training accuracy: 97.74%

As we can see, the model classifies most of the training digits correctly, but how does it generalize to data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset:

>>> y_test_pred = nn.predict(X_test)
>>> acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0]
>>> print('Training accuracy: %.2f%%' % (acc * 100))
Test accuracy: 96.18%

Based on the small discrepancy between training and test accuracy, we can conclude that the model only slightly overfits the training data. To further fine-tune the model, we could change the number of hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader).

Now, let's take a look at some of the images that our MLP struggles with:

>>> miscl_img = X_test[y_test != y_test_pred][:25]
>>> correct_lab = y_test[y_test != y_test_pred][:25]
>>> miscl_lab= y_test_pred[y_test != y_test_pred][:25]

>>> fig, ax = plt.subplots(nrows=5, 
...                        ncols=5, 
...                        sharex=True, 
...                        sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(25):
...     img = miscl_img[i].reshape(28, 28)
...     ax[i].imshow(img, 
...                  cmap='Greys', 
...                  interpolation='nearest')
...     ax[i].set_title('%d) t: %d p: %d' 
...                     % (i+1, correct_lab[i], miscl_lab[i]))
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

We should now see a Implementing a multi-layer perceptron subplot matrix where the first number in the subtitles indicates the plot index, the second number indicates the true class label (t), and the third number stands for the predicted class label (p).

Implementing a multi-layer perceptron

As we can see in the preceding figure, some of those images are even challenging for us humans to classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of the digit has a hook-like curvature (subplots 3, 16, and 17).

Obtaining the MNIST dataset

The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and consists of the following four parts:

  • Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, and 60,000 samples)
  • Training set labels: train-labels-idx1-ubyte.gz (29 KB, 60 KB unzipped, and 60,000 labels)
  • Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, unzipped and 10,000 samples)
  • Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, and 10,000 labels)

The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.

After downloading the files, I recommend unzipping the files using the Unix/Linux gzip tool from the command line terminal for efficiency using the following command in your local MNIST download directory:

gzip *ubyte.gz -d

Alternatively, you could use your favorite unzipping tool if you are working with a machine running on Microsoft Windows. The images are stored in byte format, and we will read them into NumPy arrays that we will use to train and test our MLP implementation:

import os
import struct
import numpy as np

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path, 
                               '%s-labels-idx1-ubyte' 
                                % kind)
    images_path = os.path.join(path, 
                               '%s-images-idx3-ubyte' 
                               % kind)
        
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', 
                                 lbpath.read(8))
        labels = np.fromfile(lbpath, 
                             dtype=np.uint8)

    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", 
                                               imgpath.read(16))
        images = np.fromfile(imgpath, 
                    dtype=np.uint8).reshape(len(labels), 784)
 
    return images, labels

The load_mnist function returns two arrays, the first being an Obtaining the MNIST dataset dimensional NumPy array (images), where Obtaining the MNIST dataset is the number of samples and Obtaining the MNIST dataset is the number of features. The training dataset consists of 60,000 training digits and the test set contains 10,000 samples, respectively. The images in the MNIST dataset consist of Obtaining the MNIST dataset pixels, and each pixel is represented by a gray scale intensity value. Here, we unroll the Obtaining the MNIST dataset pixels into 1D row vectors, which represent the rows in our image array (784 per row or image). The second array (labels) returned by the load_mnist function contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits.

The way we read in the image might seem a little bit strange at first:

magic, n = struct.unpack('>II', lbpath.read(8))
labels = np.fromfile(lbpath, dtype=np.int8)

To understand how these two lines of code work, let's take a look at the dataset description from the MNIST website:

Obtaining the MNIST dataset

Using the two lines of the preceding code, we first read in the magic number, which is a description of the file protocol as well as the number of items (n) from the file buffer before we read the following bytes into a NumPy array using the fromfile method. The fmt parameter value >II that we passed as an argument to struct.unpack has two parts:

  • >: This is the big-endian (defines the order in which a sequence of bytes is stored); if you are unfamiliar with the terms big-endian and small-endian, you can find an excellent article about Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness).
  • I: This is an unsigned integer.

By executing the following code, we will now load the 60,000 training instances as well as the 10,000 test samples from the mnist directory where we unzipped the MNIST dataset:

>>> X_train, y_train = load_mnist('mnist', kind='train')
>>> print('Rows: %d, columns: %d' 
...        % (X_train.shape[0], X_train.shape[1]))
Rows: 60000, columns: 784

>>> X_test, y_test = load_mnist('mnist', kind='t10k')
>>> print('Rows: %d, columns: %d'
...        % (X_test.shape[0], X_test.shape[1]))
Rows: 10000, columns: 784

To get a idea what the images in MNIST look like, let's visualize examples of the digits 0-9 after reshaping the 784-pixel vectors from our feature matrix into the original 28 × 28 image that we can plot via matplotlib's imshow function:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(10):
...    img = X_train[y_train == i][0].reshape(28, 28)
...    ax[i].imshow(img, cmap='Greys', interpolation='nearest')
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

We should now see a plot of the Obtaining the MNIST dataset subfigures showing a representative image of each unique digit:

Obtaining the MNIST dataset

In addition, let's also plot multiple examples of the same digit to see how different those handwriting examples really are:

>>> fig, ax = plt.subplots(nrows=5, 
...                        ncols=5, 
...                        sharex=True, 
...                        sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(25):
...     img = X_train[y_train == 7][i].reshape(28, 28)
...     ax[i].imshow(img, cmap='Greys', interpolation='nearest')
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

After executing the code, we should now see the first 25 variants of the digit 7.

Obtaining the MNIST dataset

Optionally, we can save the MNIST image data and labels as CSV files to open them in programs that do not support their special byte format. However, we should be aware that the CSV file format will take up substantially more space on your local drive, as listed here:

  • train_img.csv: 109.5 MB
  • train_labels.csv: 120 KB
  • test_img.csv: 18.3 MB
  • test_labels: 20 KB

If we decide to save those CSV files, we can execute the following code in our Python session after loading the MNIST data into NumPy arrays:

>>> np.savetxt('train_img.csv', X_train, 
...            fmt='%i', delimiter=',')
>>> np.savetxt('train_labels.csv', y_train,
...            fmt='%i', delimiter=',')
>>> np.savetxt('test_img.csv', X_test,
...            fmt='%i', delimiter=',')
>>> np.savetxt('test_labels.csv', y_test, 
...            fmt='%i', delimiter=',')

Once we have saved the CSV files, we can load them back into Python using NumPy's genfromtxt function:

>>> X_train = np.genfromtxt('train_img.csv', 
...                         dtype=int, delimiter=',')
>>> y_train = np.genfromtxt('train_labels.csv',
...                         dtype=int, delimiter=',')
>>> X_test = np.genfromtxt('test_img.csv',
...                        dtype=int, delimiter=',')
>>> y_test = np.genfromtxt('test_labels.csv',
...                        dtype=int, delimiter=',')

However, it will take substantially longer to load the MNIST data from the CSV files, thus I recommend you stick to the original byte format if possible.

Implementing a multi-layer perceptron

In this subsection, we will now implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as possible. However, it may seem a little bit complicated at first, and I encourage you to download the sample code for this chapter from the Packt Publishing website, where you can find this MLP implementation annotated with comments and syntax highlighting for better readability. If you are not running the code from the accompanying IPython notebook, I recommend you copy it into a Python script file in your current working directory, for example, neuralnet.py, which you can then import into your current Python session via the following command:

from neuralnet import NeuralNetMLP

The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification, and the discussion of forward propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we will follow up on certain parts later in this chapter. However, going over the code at this stage can make it easier to follow the theory later.

import numpy as np
from scipy.special import expit
import sys

class NeuralNetMLP(object):
    def __init__(self, n_output, n_features, n_hidden=30,
                 l1=0.0, l2=0.0, epochs=500, eta=0.001, 
                 alpha=0.0, decrease_const=0.0, shuffle=True, 
                 minibatches=1, random_state=None):
        np.random.seed(random_state)
        self.n_output = n_output
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.w1, self.w2 = self._initialize_weights()
        self.l1 = l1
        self.l2 = l2
        self.epochs = epochs
        self.eta = eta
        self.alpha = alpha
        self.decrease_const = decrease_const
        self.shuffle = shuffle
        self.minibatches = minibatches

    def _encode_labels(self, y, k):
        onehot = np.zeros((k, y.shape[0]))
        for idx, val in enumerate(y):
            onehot[val, idx] = 1.0
        return onehot

    def _initialize_weights(self):
        w1 = np.random.uniform(-1.0, 1.0,       
                     size=self.n_hidden*(self.n_features + 1))
        w1 = w1.reshape(self.n_hidden, self.n_features + 1)
        w2 = np.random.uniform(-1.0, 1.0,
                     size=self.n_output*(self.n_hidden + 1))
        w2 = w2.reshape(self.n_output, self.n_hidden + 1)
        return w1, w2

    def _sigmoid(self, z):
        # expit is equivalent to 1.0/(1.0 + np.exp(-z))
        return expit(z)

    def _sigmoid_gradient(self, z):
        sg = self._sigmoid(z)
        return sg * (1 - sg)

    def _add_bias_unit(self, X, how='column'):
        if how == 'column':
            X_new = np.ones((X.shape[0], X.shape[1]+1))
            X_new[:, 1:] = X
        elif how == 'row':
            X_new = np.ones((X.shape[0]+1, X.shape[1]))
            X_new[1:, :] = X
        else:
            raise AttributeError('`how` must be `column` or `row`')
        return X_new

    def _feedforward(self, X, w1, w2):
        a1 = self._add_bias_unit(X, how='column')
        z2 = w1.dot(a1.T)
        a2 = self._sigmoid(z2)
        a2 = self._add_bias_unit(a2, how='row')
        z3 = w2.dot(a2)
        a3 = self._sigmoid(z3)
        return a1, z2, a2, z3, a3

    def _L2_reg(self, lambda_, w1, w2):
        return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2)\
                + np.sum(w2[:, 1:] ** 2))

    def _L1_reg(self, lambda_, w1, w2):
        return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum()\
                + np.abs(w2[:, 1:]).sum())

    def _get_cost(self, y_enc, output, w1, w2):
        term1 = -y_enc * (np.log(output))
        term2 = (1 - y_enc) * np.log(1 - output)
        cost = np.sum(term1 - term2)
        L1_term = self._L1_reg(self.l1, w1, w2)
        L2_term = self._L2_reg(self.l2, w1, w2)
        cost = cost + L1_term + L2_term
        return cost

    def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2):
        # backpropagation
        sigma3 = a3 - y_enc
        z2 = self._add_bias_unit(z2, how='row')
        sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2)
        sigma2 = sigma2[1:, :]
        grad1 = sigma2.dot(a1)
        grad2 = sigma3.dot(a2.T)

        # regularize
        grad1[:, 1:] += (w1[:, 1:] * (self.l1 + self.l2))
        grad2[:, 1:] += (w2[:, 1:] * (self.l1 + self.l2))

        return grad1, grad2

    def predict(self, X):
        a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2)
        y_pred = np.argmax(z3, axis=0)
        return y_pred

    def fit(self, X, y, print_progress=False):
        self.cost_ = []
        X_data, y_data = X.copy(), y.copy()
        y_enc = self._encode_labels(y, self.n_output)

        delta_w1_prev = np.zeros(self.w1.shape)
        delta_w2_prev = np.zeros(self.w2.shape)

        for i in range(self.epochs):

            # adaptive learning rate
            self.eta /= (1 + self.decrease_const*i)

            if print_progress:
                sys.stderr.write(
                        '\rEpoch: %d/%d' % (i+1, self.epochs))
                sys.stderr.flush()

            if self.shuffle:
                idx = np.random.permutation(y_data.shape[0])
                X_data, y_data = X_data[idx], y_data[idx]

            mini = np.array_split(range(
                         y_data.shape[0]), self.minibatches)
            for idx in mini:

                # feedforward
                a1, z2, a2, z3, a3 = self._feedforward(
                                     X[idx], self.w1, self.w2)
                cost = self._get_cost(y_enc=y_enc[:, idx],
                                      output=a3,
                                      w1=self.w1,
                                      w2=self.w2)
                self.cost_.append(cost)

                # compute gradient via backpropagation
                grad1, grad2 = self._get_gradient(a1=a1, a2=a2,
                                            a3=a3, z2=z2,
                                            y_enc=y_enc[:, idx],
                                            w1=self.w1,
                                            w2=self.w2)

                # update weights
                delta_w1, delta_w2 = self.eta * grad1,\
                                     self.eta * grad2
                self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev))
                self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev))
                delta_w1_prev, delta_w2_prev = delta_w1, delta_w2

        return self

Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features), 50 hidden units (n_hidden), and 10 output units (n_output):

>>> nn = NeuralNetMLP(n_output=10, 
...                   n_features=X_train.shape[1], 
...                   n_hidden=50, 
...                   l2=0.1, 
...                   l1=0.0, 
...                   epochs=1000, 
...                   eta=0.001,
...                   alpha=0.001,
...                   decrease_const=0.00001,
...                   shuffle=True,
...                   minibatches=50, 
...                   random_state=1)

As you may have noticed, by going over our preceding MLP implementation, we also implemented some additional features, which are summarized here:

  • l2: The Implementing a multi-layer perceptron parameter for L2 regularization to decrease the degree of overfitting; equivalently, l1 is the Implementing a multi-layer perceptron parameter for L1 regularization.
  • epochs: The number of passes over the training set.
  • eta: The learning rate Implementing a multi-layer perceptron.
  • alpha: A parameter for momentum learning to add a factor of the previous gradient to the weight update for faster learning Implementing a multi-layer perceptron (where Implementing a multi-layer perceptron is the current time step or epoch).
  • decrease_const: The decrease constant Implementing a multi-layer perceptron for an adaptive learning rate Implementing a multi-layer perceptron that decreases over time for better convergence Implementing a multi-layer perceptron.
  • shuffle: Shuffling the training set prior to every epoch to prevent the algorithm from getting stuck in cycles.
  • Minibatches: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning.

Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. Before you execute the following code, please note that training the neural network may take 10-30 minutes on standard desktop computer hardware:

>>> nn.fit(X_train, y_train, print_progress=True)
Epoch: 1000/1000

Similar to our previous Adaline implementation, we save the cost for each epoch in a cost_ list that we can now visualize, making sure that the optimization algorithm reached convergence. Here, we only plot every 50th step to account for the 50 mini-batches (50 mini-batches × 1000 epochs). The code is as follows:

>>> plt.plot(range(len(nn.cost_)), nn.cost_)
>>> plt.ylim([0, 2000])
>>> plt.ylabel('Cost')
>>> plt.xlabel('Epochs * 50')
>>> plt.tight_layout()
>>> plt.show()

As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent.

Implementing a multi-layer perceptron

Although we can already see in the plot that the optimization algorithm converged after approximately 800 epochs (40,000/50 = 800), let's plot a smoother version of the cost function against the number of epochs by averaging over the mini-batch intervals. The code is as follows:

>>> batches = np.array_split(range(len(nn.cost_)), 1000)
>>> cost_ary = np.array(nn.cost_)
>>> cost_avgs = [np.mean(cost_ary[i]) for i in batches]

>>> plt.plot(range(len(cost_avgs)),
...          cost_avgs, 
...          color='red')
>>> plt.ylim([0, 2000])
>>> plt.ylabel('Cost')
>>> plt.xlabel('Epochs')
>>> plt.tight_layout()
>>> plt.show()

The following plot gives us a clearer picture indicating that the training algorithm converged shortly after the 800th epoch:

Implementing a multi-layer perceptron

Now, let's evaluate the performance of the model by calculating the prediction accuracy:

>>> y_train_pred = nn.predict(X_train)
>>> acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0]
>>> print('Training accuracy: %.2f%%' % (acc * 100))
Training accuracy: 97.74%

As we can see, the model classifies most of the training digits correctly, but how does it generalize to data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset:

>>> y_test_pred = nn.predict(X_test)
>>> acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0]
>>> print('Training accuracy: %.2f%%' % (acc * 100))
Test accuracy: 96.18%

Based on the small discrepancy between training and test accuracy, we can conclude that the model only slightly overfits the training data. To further fine-tune the model, we could change the number of hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader).

Now, let's take a look at some of the images that our MLP struggles with:

>>> miscl_img = X_test[y_test != y_test_pred][:25]
>>> correct_lab = y_test[y_test != y_test_pred][:25]
>>> miscl_lab= y_test_pred[y_test != y_test_pred][:25]

>>> fig, ax = plt.subplots(nrows=5, 
...                        ncols=5, 
...                        sharex=True, 
...                        sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(25):
...     img = miscl_img[i].reshape(28, 28)
...     ax[i].imshow(img, 
...                  cmap='Greys', 
...                  interpolation='nearest')
...     ax[i].set_title('%d) t: %d p: %d' 
...                     % (i+1, correct_lab[i], miscl_lab[i]))
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

We should now see a Implementing a multi-layer perceptron subplot matrix where the first number in the subtitles indicates the plot index, the second number indicates the true class label (t), and the third number stands for the predicted class label (p).

Implementing a multi-layer perceptron

As we can see in the preceding figure, some of those images are even challenging for us humans to classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of the digit has a hook-like curvature (subplots 3, 16, and 17).

Implementing a multi-layer perceptron

In this subsection, we will now implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as possible. However, it may seem a little bit complicated at first, and I encourage you to download the sample code for this chapter from the Packt Publishing website, where you can find this MLP implementation annotated with comments and syntax highlighting for better readability. If you are not running the code from the accompanying IPython notebook, I recommend you copy it into a Python script file in your current working directory, for example, neuralnet.py, which you can then import into your current Python session via the following command:

from neuralnet import NeuralNetMLP

The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification, and the discussion of forward propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we will follow up on certain parts later in this chapter. However, going over the code at this stage can make it easier to follow the theory later.

import numpy as np
from scipy.special import expit
import sys

class NeuralNetMLP(object):
    def __init__(self, n_output, n_features, n_hidden=30,
                 l1=0.0, l2=0.0, epochs=500, eta=0.001, 
                 alpha=0.0, decrease_const=0.0, shuffle=True, 
                 minibatches=1, random_state=None):
        np.random.seed(random_state)
        self.n_output = n_output
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.w1, self.w2 = self._initialize_weights()
        self.l1 = l1
        self.l2 = l2
        self.epochs = epochs
        self.eta = eta
        self.alpha = alpha
        self.decrease_const = decrease_const
        self.shuffle = shuffle
        self.minibatches = minibatches

    def _encode_labels(self, y, k):
        onehot = np.zeros((k, y.shape[0]))
        for idx, val in enumerate(y):
            onehot[val, idx] = 1.0
        return onehot

    def _initialize_weights(self):
        w1 = np.random.uniform(-1.0, 1.0,       
                     size=self.n_hidden*(self.n_features + 1))
        w1 = w1.reshape(self.n_hidden, self.n_features + 1)
        w2 = np.random.uniform(-1.0, 1.0,
                     size=self.n_output*(self.n_hidden + 1))
        w2 = w2.reshape(self.n_output, self.n_hidden + 1)
        return w1, w2

    def _sigmoid(self, z):
        # expit is equivalent to 1.0/(1.0 + np.exp(-z))
        return expit(z)

    def _sigmoid_gradient(self, z):
        sg = self._sigmoid(z)
        return sg * (1 - sg)

    def _add_bias_unit(self, X, how='column'):
        if how == 'column':
            X_new = np.ones((X.shape[0], X.shape[1]+1))
            X_new[:, 1:] = X
        elif how == 'row':
            X_new = np.ones((X.shape[0]+1, X.shape[1]))
            X_new[1:, :] = X
        else:
            raise AttributeError('`how` must be `column` or `row`')
        return X_new

    def _feedforward(self, X, w1, w2):
        a1 = self._add_bias_unit(X, how='column')
        z2 = w1.dot(a1.T)
        a2 = self._sigmoid(z2)
        a2 = self._add_bias_unit(a2, how='row')
        z3 = w2.dot(a2)
        a3 = self._sigmoid(z3)
        return a1, z2, a2, z3, a3

    def _L2_reg(self, lambda_, w1, w2):
        return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2)\
                + np.sum(w2[:, 1:] ** 2))

    def _L1_reg(self, lambda_, w1, w2):
        return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum()\
                + np.abs(w2[:, 1:]).sum())

    def _get_cost(self, y_enc, output, w1, w2):
        term1 = -y_enc * (np.log(output))
        term2 = (1 - y_enc) * np.log(1 - output)
        cost = np.sum(term1 - term2)
        L1_term = self._L1_reg(self.l1, w1, w2)
        L2_term = self._L2_reg(self.l2, w1, w2)
        cost = cost + L1_term + L2_term
        return cost

    def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2):
        # backpropagation
        sigma3 = a3 - y_enc
        z2 = self._add_bias_unit(z2, how='row')
        sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2)
        sigma2 = sigma2[1:, :]
        grad1 = sigma2.dot(a1)
        grad2 = sigma3.dot(a2.T)

        # regularize
        grad1[:, 1:] += (w1[:, 1:] * (self.l1 + self.l2))
        grad2[:, 1:] += (w2[:, 1:] * (self.l1 + self.l2))

        return grad1, grad2

    def predict(self, X):
        a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2)
        y_pred = np.argmax(z3, axis=0)
        return y_pred

    def fit(self, X, y, print_progress=False):
        self.cost_ = []
        X_data, y_data = X.copy(), y.copy()
        y_enc = self._encode_labels(y, self.n_output)

        delta_w1_prev = np.zeros(self.w1.shape)
        delta_w2_prev = np.zeros(self.w2.shape)

        for i in range(self.epochs):

            # adaptive learning rate
            self.eta /= (1 + self.decrease_const*i)

            if print_progress:
                sys.stderr.write(
                        '\rEpoch: %d/%d' % (i+1, self.epochs))
                sys.stderr.flush()

            if self.shuffle:
                idx = np.random.permutation(y_data.shape[0])
                X_data, y_data = X_data[idx], y_data[idx]

            mini = np.array_split(range(
                         y_data.shape[0]), self.minibatches)
            for idx in mini:

                # feedforward
                a1, z2, a2, z3, a3 = self._feedforward(
                                     X[idx], self.w1, self.w2)
                cost = self._get_cost(y_enc=y_enc[:, idx],
                                      output=a3,
                                      w1=self.w1,
                                      w2=self.w2)
                self.cost_.append(cost)

                # compute gradient via backpropagation
                grad1, grad2 = self._get_gradient(a1=a1, a2=a2,
                                            a3=a3, z2=z2,
                                            y_enc=y_enc[:, idx],
                                            w1=self.w1,
                                            w2=self.w2)

                # update weights
                delta_w1, delta_w2 = self.eta * grad1,\
                                     self.eta * grad2
                self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev))
                self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev))
                delta_w1_prev, delta_w2_prev = delta_w1, delta_w2

        return self

Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features), 50 hidden units (n_hidden), and 10 output units (n_output):

>>> nn = NeuralNetMLP(n_output=10, 
...                   n_features=X_train.shape[1], 
...                   n_hidden=50, 
...                   l2=0.1, 
...                   l1=0.0, 
...                   epochs=1000, 
...                   eta=0.001,
...                   alpha=0.001,
...                   decrease_const=0.00001,
...                   shuffle=True,
...                   minibatches=50, 
...                   random_state=1)

As you may have noticed, by going over our preceding MLP implementation, we also implemented some additional features, which are summarized here:

  • l2: The Implementing a multi-layer perceptron parameter for L2 regularization to decrease the degree of overfitting; equivalently, l1 is the Implementing a multi-layer perceptron parameter for L1 regularization.
  • epochs: The number of passes over the training set.
  • eta: The learning rate Implementing a multi-layer perceptron.
  • alpha: A parameter for momentum learning to add a factor of the previous gradient to the weight update for faster learning Implementing a multi-layer perceptron (where Implementing a multi-layer perceptron is the current time step or epoch).
  • decrease_const: The decrease constant Implementing a multi-layer perceptron for an adaptive learning rate Implementing a multi-layer perceptron that decreases over time for better convergence Implementing a multi-layer perceptron.
  • shuffle: Shuffling the training set prior to every epoch to prevent the algorithm from getting stuck in cycles.
  • Minibatches: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning.

Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. Before you execute the following code, please note that training the neural network may take 10-30 minutes on standard desktop computer hardware:

>>> nn.fit(X_train, y_train, print_progress=True)
Epoch: 1000/1000

Similar to our previous Adaline implementation, we save the cost for each epoch in a cost_ list that we can now visualize, making sure that the optimization algorithm reached convergence. Here, we only plot every 50th step to account for the 50 mini-batches (50 mini-batches × 1000 epochs). The code is as follows:

>>> plt.plot(range(len(nn.cost_)), nn.cost_)
>>> plt.ylim([0, 2000])
>>> plt.ylabel('Cost')
>>> plt.xlabel('Epochs * 50')
>>> plt.tight_layout()
>>> plt.show()

As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent.

Implementing a multi-layer perceptron

Although we can already see in the plot that the optimization algorithm converged after approximately 800 epochs (40,000/50 = 800), let's plot a smoother version of the cost function against the number of epochs by averaging over the mini-batch intervals. The code is as follows:

>>> batches = np.array_split(range(len(nn.cost_)), 1000)
>>> cost_ary = np.array(nn.cost_)
>>> cost_avgs = [np.mean(cost_ary[i]) for i in batches]

>>> plt.plot(range(len(cost_avgs)),
...          cost_avgs, 
...          color='red')
>>> plt.ylim([0, 2000])
>>> plt.ylabel('Cost')
>>> plt.xlabel('Epochs')
>>> plt.tight_layout()
>>> plt.show()

The following plot gives us a clearer picture indicating that the training algorithm converged shortly after the 800th epoch:

Implementing a multi-layer perceptron

Now, let's evaluate the performance of the model by calculating the prediction accuracy:

>>> y_train_pred = nn.predict(X_train)
>>> acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0]
>>> print('Training accuracy: %.2f%%' % (acc * 100))
Training accuracy: 97.74%

As we can see, the model classifies most of the training digits correctly, but how does it generalize to data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset:

>>> y_test_pred = nn.predict(X_test)
>>> acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0]
>>> print('Training accuracy: %.2f%%' % (acc * 100))
Test accuracy: 96.18%

Based on the small discrepancy between training and test accuracy, we can conclude that the model only slightly overfits the training data. To further fine-tune the model, we could change the number of hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader).

Now, let's take a look at some of the images that our MLP struggles with:

>>> miscl_img = X_test[y_test != y_test_pred][:25]
>>> correct_lab = y_test[y_test != y_test_pred][:25]
>>> miscl_lab= y_test_pred[y_test != y_test_pred][:25]

>>> fig, ax = plt.subplots(nrows=5, 
...                        ncols=5, 
...                        sharex=True, 
...                        sharey=True,)
>>> ax = ax.flatten()
>>> for i in range(25):
...     img = miscl_img[i].reshape(28, 28)
...     ax[i].imshow(img, 
...                  cmap='Greys', 
...                  interpolation='nearest')
...     ax[i].set_title('%d) t: %d p: %d' 
...                     % (i+1, correct_lab[i], miscl_lab[i]))
>>> ax[0].set_xticks([])
>>> ax[0].set_yticks([])
>>> plt.tight_layout()
>>> plt.show()

We should now see a Implementing a multi-layer perceptron subplot matrix where the first number in the subtitles indicates the plot index, the second number indicates the true class label (t), and the third number stands for the predicted class label (p).

Implementing a multi-layer perceptron

As we can see in the preceding figure, some of those images are even challenging for us humans to classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of the digit has a hook-like curvature (subplots 3, 16, and 17).

Training an artificial neural network

Now that we have seen a neural network in action and have gained a basic understanding of how it works by looking over the code, let's dig a little bit deeper into some of the concepts, such as the logistic cost function and the backpropagation algorithm that we implemented to learn the weights.

Computing the logistic cost function

The logistic cost function that we implemented as the _get_cost method is actually pretty simple to follow since it is the same cost function that we described in the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.

Computing the logistic cost function

Here, Computing the logistic cost function is the sigmoid activation of the Computing the logistic cost function unit in one of the layers which we compute in the forward propagation step:

Computing the logistic cost function

Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):

Computing the logistic cost function

Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization term. By adding the L2 regularization term to our logistic cost function, we obtain the following equation:

Computing the logistic cost function

Since we implemented an MLP for multi-class classification, this returns an output vector of Computing the logistic cost function elements, which we need to compare with the Computing the logistic cost function dimensional target vector in the one-hot encoding representation. For example, the activation of the third layer and the target class (here: class 2) for a particular sample may look like this:

Computing the logistic cost function

Thus, we need to generalize the logistic cost function to all activation units Computing the logistic cost function in our network. So our cost function (without the regularization term) becomes:

Computing the logistic cost function

Here, the superscript Computing the logistic cost function is the index of a particular sample in our training set.

The following generalized regularization term may look a little bit complicated at first, but here we are just calculating the sum of all weights of a layer Computing the logistic cost function (without the bias term) that we added to the first column:

Computing the logistic cost function

The following equation represents the L2-penalty term:

Computing the logistic cost function

Remember that our goal is to minimize the cost function Computing the logistic cost function. Thus, we need to calculate the partial derivative of matrix Computing the logistic cost function with respect to each weight for every layer in the network:

Computing the logistic cost function

In the next section, we will talk about the backpropagation algorithm, which allows us to calculate these partial derivatives to minimize the cost function.

Note that Computing the logistic cost function consists of multiple matrices. In a multi-layer perceptron with one hidden unit, we have the weight matrix Computing the logistic cost function, which connects the input to the hidden layer, and Computing the logistic cost function, which connects the hidden layer to the output layer. An intuitive visualization of the matrix Computing the logistic cost function is provided in the following figure:

Computing the logistic cost function

In this simplified figure, it may seem that both Computing the logistic cost function and Computing the logistic cost function have the same number of rows and columns, which is typically not the case unless we initialize an MLP with the same number of hidden units, output units, and input features.

If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality of Computing the logistic cost function and Computing the logistic cost function in more detail in the context of the backpropagation algorithm.

Training neural networks via backpropagation

In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.

In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

Concisely, we just forward propagate the input features through the connection in the network as shown here:

Training neural networks via backpropagation

In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is the vector of the true class labels.

Next, we calculate the error term of the hidden layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient:

Training neural networks via backpropagation

Note that the asterisk symbol Training neural networks via backpropagation means element-wise multiplication in this context.

Note

Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

To better understand how we compute the Training neural networks via backpropagation term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose Training neural networks via backpropagation of the Training neural networks via backpropagation dimensional matrix Training neural networks via backpropagation; t is the number of output class labels and h is the number of hidden units). Now, Training neural networks via backpropagation becomes an Training neural networks via backpropagation dimensional matrix with Training neural networks via backpropagation, which is a Training neural networks via backpropagation dimensional vector. We then performed a pair-wise multiplication between Training neural networks via backpropagation and Training neural networks via backpropagation, which is also a Training neural networks via backpropagation dimensional vector. Eventually, after obtaining the Training neural networks via backpropagation terms, we can now write the derivation of the cost function as follows:

Training neural networks via backpropagation

Next, we need to accumulate the partial derivative of every Training neural networks via backpropagationth node in layer Training neural networks via backpropagation and the Training neural networks via backpropagationth error of the node in layer Training neural networks via backpropagation:

Training neural networks via backpropagation

Remember that we need to compute Training neural networks via backpropagation for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:

Training neural networks via backpropagation

After we have accumulated the partial derivatives, we can add the regularization term as follows:

Training neural networks via backpropagation

Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:

Training neural networks via backpropagation

To bring everything together, let's summarize backpropagation in the following figure:

Training neural networks via backpropagation

Computing the logistic cost function

The logistic cost function that we implemented as the _get_cost method is actually pretty simple to follow since it is the same cost function that we described in the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.

Computing the logistic cost function

Here, Computing the logistic cost function is the sigmoid activation of the Computing the logistic cost function unit in one of the layers which we compute in the forward propagation step:

Computing the logistic cost function

Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):

Computing the logistic cost function

Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization term. By adding the L2 regularization term to our logistic cost function, we obtain the following equation:

Computing the logistic cost function

Since we implemented an MLP for multi-class classification, this returns an output vector of Computing the logistic cost function elements, which we need to compare with the Computing the logistic cost function dimensional target vector in the one-hot encoding representation. For example, the activation of the third layer and the target class (here: class 2) for a particular sample may look like this:

Computing the logistic cost function

Thus, we need to generalize the logistic cost function to all activation units Computing the logistic cost function in our network. So our cost function (without the regularization term) becomes:

Computing the logistic cost function

Here, the superscript Computing the logistic cost function is the index of a particular sample in our training set.

The following generalized regularization term may look a little bit complicated at first, but here we are just calculating the sum of all weights of a layer Computing the logistic cost function (without the bias term) that we added to the first column:

Computing the logistic cost function

The following equation represents the L2-penalty term:

Computing the logistic cost function

Remember that our goal is to minimize the cost function Computing the logistic cost function. Thus, we need to calculate the partial derivative of matrix Computing the logistic cost function with respect to each weight for every layer in the network:

Computing the logistic cost function

In the next section, we will talk about the backpropagation algorithm, which allows us to calculate these partial derivatives to minimize the cost function.

Note that Computing the logistic cost function consists of multiple matrices. In a multi-layer perceptron with one hidden unit, we have the weight matrix Computing the logistic cost function, which connects the input to the hidden layer, and Computing the logistic cost function, which connects the hidden layer to the output layer. An intuitive visualization of the matrix Computing the logistic cost function is provided in the following figure:

Computing the logistic cost function

In this simplified figure, it may seem that both Computing the logistic cost function and Computing the logistic cost function have the same number of rows and columns, which is typically not the case unless we initialize an MLP with the same number of hidden units, output units, and input features.

If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality of Computing the logistic cost function and Computing the logistic cost function in more detail in the context of the backpropagation algorithm.

Training neural networks via backpropagation

In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.

In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

Concisely, we just forward propagate the input features through the connection in the network as shown here:

Training neural networks via backpropagation

In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is the vector of the true class labels.

Next, we calculate the error term of the hidden layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient:

Training neural networks via backpropagation

Note that the asterisk symbol Training neural networks via backpropagation means element-wise multiplication in this context.

Note

Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

To better understand how we compute the Training neural networks via backpropagation term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose Training neural networks via backpropagation of the Training neural networks via backpropagation dimensional matrix Training neural networks via backpropagation; t is the number of output class labels and h is the number of hidden units). Now, Training neural networks via backpropagation becomes an Training neural networks via backpropagation dimensional matrix with Training neural networks via backpropagation, which is a Training neural networks via backpropagation dimensional vector. We then performed a pair-wise multiplication between Training neural networks via backpropagation and Training neural networks via backpropagation, which is also a Training neural networks via backpropagation dimensional vector. Eventually, after obtaining the Training neural networks via backpropagation terms, we can now write the derivation of the cost function as follows:

Training neural networks via backpropagation

Next, we need to accumulate the partial derivative of every Training neural networks via backpropagationth node in layer Training neural networks via backpropagation and the Training neural networks via backpropagationth error of the node in layer Training neural networks via backpropagation:

Training neural networks via backpropagation

Remember that we need to compute Training neural networks via backpropagation for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:

Training neural networks via backpropagation

After we have accumulated the partial derivatives, we can add the regularization term as follows:

Training neural networks via backpropagation

Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:

Training neural networks via backpropagation

To bring everything together, let's summarize backpropagation in the following figure:

Training neural networks via backpropagation

Training neural networks via backpropagation

In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.

In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

Concisely, we just forward propagate the input features through the connection in the network as shown here:

Training neural networks via backpropagation

In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is the vector of the true class labels.

Next, we calculate the error term of the hidden layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient:

Training neural networks via backpropagation

Note that the asterisk symbol Training neural networks via backpropagation means element-wise multiplication in this context.

Note

Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

To better understand how we compute the Training neural networks via backpropagation term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose Training neural networks via backpropagation of the Training neural networks via backpropagation dimensional matrix Training neural networks via backpropagation; t is the number of output class labels and h is the number of hidden units). Now, Training neural networks via backpropagation becomes an Training neural networks via backpropagation dimensional matrix with Training neural networks via backpropagation, which is a Training neural networks via backpropagation dimensional vector. We then performed a pair-wise multiplication between Training neural networks via backpropagation and Training neural networks via backpropagation, which is also a Training neural networks via backpropagation dimensional vector. Eventually, after obtaining the Training neural networks via backpropagation terms, we can now write the derivation of the cost function as follows:

Training neural networks via backpropagation

Next, we need to accumulate the partial derivative of every Training neural networks via backpropagationth node in layer Training neural networks via backpropagation and the Training neural networks via backpropagationth error of the node in layer Training neural networks via backpropagation:

Training neural networks via backpropagation

Remember that we need to compute Training neural networks via backpropagation for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:

Training neural networks via backpropagation

After we have accumulated the partial derivatives, we can add the regularization term as follows:

Training neural networks via backpropagation

Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:

Training neural networks via backpropagation

To bring everything together, let's summarize backpropagation in the following figure:

Training neural networks via backpropagation

Developing your intuition for backpropagation

Although backpropagation was rediscovered and popularized almost 30 years ago, it still remains one of the most widely used algorithms to train artificial neural networks very efficiently. In this section, we'll see a more intuitive summary and the bigger picture of how this fascinating algorithm works.

In essence, backpropagation is just a very computationally efficient approach to compute the derivatives of a complex cost function. Our goal is to use those derivatives to learn the weight coefficients for parameterizing a multi-layer artificial neural network. The challenge in the parameterization of neural networks is that we are typically dealing with a very large number of weight coefficients in a high-dimensional feature space. In contrast to other cost functions that we have seen in previous chapters, the error surface of a neural network cost function is not convex or smooth. There are many bumps in this high-dimensional cost surface (local minima) that we have to overcome in order to find the global minimum of the cost function.

You may recall the concept of the chain rule from your introductory calculus classes. The chain rule is an approach to deriving a complex, nested function, for example, Developing your intuition for backpropagation that is broken down into basic components:

Developing your intuition for backpropagation

In the context of computer algebra, a set of techniques has been developed to solve such problems very efficiently, which is also known as automatic differentiation. If you are interested in learning more about automatic differentiation in machine learning applications, I recommend you to refer to the following resource: A. G. Baydin and B. A. Pearlmutter. Automatic Differentiation of Algorithms for Machine Learning. arXiv preprint arXiv:1404.7456, 2014, which is freely available on arXiv at http://arxiv.org/pdf/1404.7456.pdf.

Automatic differentiation comes with two modes, the forward and the reverse mode, respectively. Backpropagation is simply just a special case of the reverse-mode automatic differentiation. The key point is that applying the chain rule in the forward mode can be quite expensive since we would have to multiply large matrices for each layer (Jacobians) that we eventually multiply by a vector to obtain the output. The trick of the reverse mode is that we start from right to left: we multiply a matrix by a vector, which yields another vector that is multiplied by the next matrix and so on. Matrix-vector multiplication is computationally much cheaper than matrix-matrix multiplication, which is why backpropagation is one of the most popular algorithms used in neural network training.

Debugging neural networks with gradient checking

Implementations of artificial neural networks can be quite complex, and it is always a good idea to manually check that we have implemented backpropagation correctly. In this section, we will talk about a simple procedure called gradient checking, which is essentially a comparison between our analytical gradients in the network and numerical gradients. Gradient checking is not specific to feedforward neural networks but can be applied to any other neural network architecture that uses gradient-based optimization. Even if you are planning to implement more trivial algorithms using gradient-based optimization, such as linear regression, logistic regression, and support vector machines, it is generally not a bad idea to check if the gradients are computed correctly.

In the previous sections, we defined a cost function Debugging neural networks with gradient checking where Debugging neural networks with gradient checking is the matrix of the weight coefficients of an artificial network. Note that Debugging neural networks with gradient checking is—roughly speaking—a "stacked" matrix consisting of the matrices Debugging neural networks with gradient checking and Debugging neural networks with gradient checking in a multi-layer perceptron with one hidden unit. We defined Debugging neural networks with gradient checking as the Debugging neural networks with gradient checking-dimensional matrix that connects the input layer to the hidden layer, where Debugging neural networks with gradient checking is the number of hidden units and Debugging neural networks with gradient checking is the number of features (input units). The matrix Debugging neural networks with gradient checking that connects the hidden layer to the output layer has the dimensions Debugging neural networks with gradient checking, where Debugging neural networks with gradient checking is the number of output units. We then calculated the derivative of the cost function for a weight Debugging neural networks with gradient checking as follows:

Debugging neural networks with gradient checking

Remember that we are updating the weights by taking an opposite step towards the direction of the gradient. In gradient checking, we compare this analytical solution to a numerically approximated gradient:

Debugging neural networks with gradient checking

Here, Debugging neural networks with gradient checking is typically a very small number, for example 1e-5 (note that 1e-5 is just a more convenient notation for 0.00001). Intuitively, we can think of this finite difference approximation as the slope of the secant line connecting the points of the cost function for the two weights Debugging neural networks with gradient checking and Debugging neural networks with gradient checking (both are scalar values), as shown in the following figure. We are omitting the superscripts and subscripts for simplicity.

Debugging neural networks with gradient checking

An even better approach that yields a more accurate approximation of the gradient is to compute the symmetric (or centered) difference quotient given by the two-point formula:

Debugging neural networks with gradient checking

Typically, the approximated difference between the numerical gradient Debugging neural networks with gradient checking and analytical gradient Debugging neural networks with gradient checking is then calculated as the L2 vector norm. For practical reasons, we unroll the computed gradient matrices into flat vectors so that we can calculate the error (the difference between the gradient vectors) more conveniently:

Debugging neural networks with gradient checking

One problem is that the error is not scale invariant (small errors are more significant if the weight vector norms are small too). Thus, it is recommended to calculate a normalized difference:

Debugging neural networks with gradient checking

Now, we want the relative error between the numerical gradient and the analytical gradient to be as small as possible. Before we implement gradient checking, we need to discuss one more detail: what is the acceptable error threshold to pass the gradient check? The relative error threshold for passing the gradient check depends on the complexity of the network architecture. As a rule of thumb, the more hidden layers we add, the larger the difference between the numerical and analytical gradient can become if backpropagation is implemented correctly. Since we have implemented a relatively simple neural network architecture in this chapter, we want to be rather strict about the threshold and define the following rules:

  • Relative error <= 1e-7 means everything is okay!
  • Relative error <= 1e-4 means the condition is problematic, and we should look into it.
  • Relative error > 1e-4 means there is probably something wrong in our code.

Now we have established these ground rules, let's implement gradient checking. To do so, we can simply take the NeuralNetMLP class that we implemented previously and add the following method to the class body:

def _gradient_checking(self, X, y_enc, w1, 
                       w2, epsilon, grad1, grad2):
    """ Apply gradient checking (for debugging only)

    Returns
    ---------
    relative_error : float
      Relative error between the numerically
      approximated gradients and the backpropagated gradients.

    """
    num_grad1 = np.zeros(np.shape(w1))
    epsilon_ary1 = np.zeros(np.shape(w1))
    for i in range(w1.shape[0]):
        for j in range(w1.shape[1]):
            epsilon_ary1[i, j] = epsilon
            a1, z2, a2, z3, a3 = self._feedforward(
                                           X, 
                                           w1 - epsilon_ary1, 
                                           w2)
            cost1 = self._get_cost(y_enc, 
                                   a3, 
                                   w1-epsilon_ary1, 
                                   w2)
            a1, z2, a2, z3, a3 = self._feedforward(
                                         X, 
                                         w1 + epsilon_ary1, 
                                         w2)
            cost2 = self._get_cost(y_enc, 
                                   a3, 
                                   w1 + epsilon_ary1, 
                                   w2)
            num_grad1[i, j] = (cost2 - cost1) / (2 * epsilon)
            epsilon_ary1[i, j] = 0

    num_grad2 = np.zeros(np.shape(w2))
    epsilon_ary2 = np.zeros(np.shape(w2))
    for i in range(w2.shape[0]):
        for j in range(w2.shape[1]):
            epsilon_ary2[i, j] = epsilon
            a1, z2, a2, z3, a3 = self._feedforward(
                                            X, 
                                            w1, 
                                            w2 - epsilon_ary2)
            cost1 = self._get_cost(y_enc, 
                                   a3, 
                                   w1, 
                                   w2 - epsilon_ary2)
            a1, z2, a2, z3, a3 = self._feedforward(
                                          X, 
                                          w1, 
                                          w2 + epsilon_ary2)
            cost2 = self._get_cost(y_enc, 
                                   a3, 
                                   w1, 
                                   w2 + epsilon_ary2)
            num_grad2[i, j] = (cost2 - cost1) / (2 * epsilon)
            epsilon_ary2[i, j] = 0

    num_grad = np.hstack((num_grad1.flatten(),
                          num_grad2.flatten()))
    grad = np.hstack((grad1.flatten(), grad2.flatten()))
    norm1 = np.linalg.norm(num_grad - grad)
    norm2 = np.linalg.norm(num_grad)
    norm3 = np.linalg.norm(grad)
    relative_error = norm1 / (norm2 + norm3)
    return relative_error 

The _gradient_checking code seems rather simple. However, my personal recommendation is to keep it as simple as possible. Our goal is to double-check the gradient computation, so we want to make sure that we do not introduce any additional mistakes in gradient checking by writing efficient but complex code. Next, we only need to make a small modification to the fit method. In the following code, I omitted the code at the beginning of the fit function for clarity, and the only lines that we need to add to the method are implemented between the comments ## start gradient checking and ## end gradient checking:

class MLPGradientCheck(object):
    [...]
    def fit(self, X, y, print_progress=False):
        [...] 
                # compute gradient via backpropagation
                grad1, grad2 = self._get_gradient(
                                       a1=a1, 
                                       a2=a2,
                                       a3=a3, 
                                       z2=z2,
                                       y_enc=y_enc[:, idx],
                                       w1=self.w1,
                                       w2=self.w2)
                
                ## start gradient checking

                grad_diff = self._gradient_checking(
                                     X=X[idx],
                                     y_enc=y_enc[:, idx],
                                     w1=self.w1, 
                                     w2=self.w2,
                                     epsilon=1e-5,
                                     grad1=grad1, 
                                     grad2=grad2)
                if grad_diff <= 1e-7:
                    print('Ok: %s' % grad_diff)
                elif grad_diff <= 1e-4:
                    print('Warning: %s' % grad_diff)
                else:
                    print('PROBLEM: %s' % grad_diff)
              
                ## end gradient checking
                
                # update weights; [alpha * delta_w_prev] 
                # for momentum learning
                delta_w1 = self.eta * grad1
                delta_w2 = self.eta * grad2
                self.w1 -= (delta_w1 +\
                           (self.alpha * delta_w1_prev))
                self.w2 -= (delta_w2 +\
                           (self.alpha * delta_w2_prev))
                delta_w1_prev = delta_w1
                delta_w2_prev = delta_w2

        return self

Assuming that we named our modified multi-layer perceptron class MLPGradientCheck, we can now initialize a new MLP with 10 hidden layers. Also, we disable regularization, adaptive learning, and momentum learning. In addition, we use regular gradient descent by setting minibatches to 1. The code is as follows:

>>> nn_check = MLPGradientCheck(n_output=10, 
                                n_features=X_train.shape[1], 
                                n_hidden=10, 
                                l2=0.0, 
                                l1=0.0, 
                                epochs=10, 
                                eta=0.001,
                                alpha=0.0,
                                decrease_const=0.0,
                                minibatches=1, 
                                random_state=1)

One downside of gradient checking is that it is computationally very, very expensive. Training a neural network with gradient checking enabled is so slow that we really only want to use it for debugging purposes. For this reason, it is not uncommon to run gradient checking only on a handful of training samples (here, we choose 5). The code is as follows:

>>> nn_check.fit(X_train[:5], y_train[:5], print_progress=False)
Ok: 2.56712936241e-10
Ok: 2.94603251069e-10
Ok: 2.37615620231e-10
Ok: 2.43469423226e-10
Ok: 3.37872073158e-10
Ok: 3.63466384861e-10
Ok: 2.22472120785e-10
Ok: 2.33163708438e-10
Ok: 3.44653686551e-10
Ok: 2.17161707211e-10 

As we can see from the code output, our multi-layer perceptron passes this test with excellent results.

Convergence in neural networks

You might be wondering why we did not use regular gradient descent but mini-batch learning to train our neural network for the handwritten digit classification. You may recall our discussion on stochastic gradient descent that we used to implement online learning. In online learning, we compute the gradient based on a single training example Convergence in neural networks at a time to perform the weight update. Although this is a stochastic approach, it often leads to very accurate solutions with a much faster convergence than regular gradient descent. Mini-batch learning is a special form of stochastic gradient descent where we compute the gradient based on a subset Convergence in neural networks of the Convergence in neural networks training samples with Convergence in neural networks. Mini-batch learning has the advantage over online learning that we can make use of our vectorized implementations to improve computational efficiency. However, we can update the weights much faster than in regular gradient descent. Intuitively, you can think of mini-batch learning as predicting the vote turnout of a presidential election from a poll by asking only a representative subset of the population rather than asking the entire population.

In addition, we added more tuning parameters such as the decrease constant and a parameter for an adaptive learning rate. The reason is that neural networks are much harder to train than simpler algorithms such as Adaline, logistic regression, or support vector machines. In multi-layer neural networks, we typically have hundreds, thousands, or even billions of weights that we need to optimize. Unfortunately, the output function has a rough surface and the optimization algorithm can easily become trapped in local minima, as shown in the following figure:

Convergence in neural networks

Note that this representation is extremely simplified since our neural network has many dimensions; it makes it impossible to visualize the actual cost surface for the human eye. Here, we only show the cost surface for a single weight on the x axis. However, the main message is that we do not want our algorithm to get trapped in local minima. By increasing the learning rate, we can more readily escape such local minima. On the other hand, we also increase the chance of overshooting the global optimum if the learning rate is too large. Since we initialize the weights randomly, we start with a solution to the optimization problem that is typically hopelessly wrong. A decrease constant, which we defined earlier, can help us to climb down the cost surface faster in the beginning and the adaptive learning rate allows us to better anneal to the global minimum.

Other neural network architectures

In this chapter, we discussed one of the most popular feedforward neural network representations, the multi-layer perceptron. Neural networks are currently one of the most active research topics in the machine learning field, and there are many other neural network architectures that are well beyond the scope of this book. If you are interested in learning more about neural networks and algorithms for deep learning, I recommend reading the introduction and overview; Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Yoshua Bengio's book is currently freely available at http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf.

Although neural networks really are a topic for another book, let's take at least a brief look at two other popular architectures, convolutional neural networks and recurrent neural networks.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs or ConvNets) gained popularity in computer vision due to their extraordinary good performance on image classification tasks. As of today, CNNs are one of the most popular neural network architectures in deep learning. The key idea behind convolutional neural networks is to build many layers of feature detectors to take the spatial arrangement of pixels in an input image into account. Note that there exist many different variants of CNNs. In this section, we will discuss only the general idea behind this architecture. If you are interested in learning more about CNNs, I recommend you to take a look at the publications of Yann LeCun (http://yann.lecun.com), who is one of the co-inventors of CNNs. In particular, I can recommend the following literature for getting started with CNNs:

  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • P. Y. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. IEEE, 2003, p.958.

As you will recall from our multi-layer perceptron implementation, we unrolled the images into feature vectors and these inputs were fully connected to the hidden layer—spatial information was not encoded in this network architecture. In CNNs, we use receptive fields to connect the input layer to a feature map. These receptive fields can be understood as overlapping windows that we slide over the pixels of an input image to create a feature map. The stride lengths of the window sliding as well as the window size are additional hyperparameters of the model that we need to define a priori. The process of creating the feature map is also called convolution. An example of such a convolutional layer, the layer that connects the input pixels to each unit in the feature map, is shown in the following figure:

Convolutional Neural Networks

It is important to note that the feature detectors are replicates, which means that the receptive fields that map the features to the units in the next layer share the same weights. Here, the key idea is that if a feature detector is useful in one part of the image, it might be useful in another part as well. The nice side effect of this approach is that it greatly reduces the number of parameters that need to be learned. Since we allow different patches of the image to be represented in different ways, CNNs are particularly good at recognizing objects of different sizes and different positions in an image. We do not need to worry so much about rescaling and centering the images as it has been done in MNIST.

In CNNs, a convolutional layer is followed by a pooling layer (sometimes also called sub-sampling). In pooling, we summarize neighboring feature detectors to reduce the number of features for the next layer. Pooling can be understood as a simple method of feature extraction where we take the average or maximum value of a patch of neighboring features and pass it on to the next layer. To create a deep convolutional neural network, we stack multiple layers—alternating between convolutional and pooling layers—before we connect it to a multi-layer perceptron for classification. This is shown in the following figure:

Convolutional Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an additional time variable. The time component and dynamic structure allows the network to use not only the current inputs but also the inputs that it encountered earlier.

Recurrent Neural Networks

Although RNNs achieved remarkable results in speech recognition, language translation, and connected handwriting recognition, these network architectures are typically much harder to train. This is because we cannot simply backpropagate the error layer by layer; we have to consider the additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, 1997.

However, we should note that there are many different variants of RNNs, and a detailed discussion is beyond the scope of this book.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs or ConvNets) gained popularity in computer vision due to their extraordinary good performance on image classification tasks. As of today, CNNs are one of the most popular neural network architectures in deep learning. The key idea behind convolutional neural networks is to build many layers of feature detectors to take the spatial arrangement of pixels in an input image into account. Note that there exist many different variants of CNNs. In this section, we will discuss only the general idea behind this architecture. If you are interested in learning more about CNNs, I recommend you to take a look at the publications of Yann LeCun (http://yann.lecun.com), who is one of the co-inventors of CNNs. In particular, I can recommend the following literature for getting started with CNNs:

  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • P. Y. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. IEEE, 2003, p.958.

As you will recall from our multi-layer perceptron implementation, we unrolled the images into feature vectors and these inputs were fully connected to the hidden layer—spatial information was not encoded in this network architecture. In CNNs, we use receptive fields to connect the input layer to a feature map. These receptive fields can be understood as overlapping windows that we slide over the pixels of an input image to create a feature map. The stride lengths of the window sliding as well as the window size are additional hyperparameters of the model that we need to define a priori. The process of creating the feature map is also called convolution. An example of such a convolutional layer, the layer that connects the input pixels to each unit in the feature map, is shown in the following figure:

Convolutional Neural Networks

It is important to note that the feature detectors are replicates, which means that the receptive fields that map the features to the units in the next layer share the same weights. Here, the key idea is that if a feature detector is useful in one part of the image, it might be useful in another part as well. The nice side effect of this approach is that it greatly reduces the number of parameters that need to be learned. Since we allow different patches of the image to be represented in different ways, CNNs are particularly good at recognizing objects of different sizes and different positions in an image. We do not need to worry so much about rescaling and centering the images as it has been done in MNIST.

In CNNs, a convolutional layer is followed by a pooling layer (sometimes also called sub-sampling). In pooling, we summarize neighboring feature detectors to reduce the number of features for the next layer. Pooling can be understood as a simple method of feature extraction where we take the average or maximum value of a patch of neighboring features and pass it on to the next layer. To create a deep convolutional neural network, we stack multiple layers—alternating between convolutional and pooling layers—before we connect it to a multi-layer perceptron for classification. This is shown in the following figure:

Convolutional Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an additional time variable. The time component and dynamic structure allows the network to use not only the current inputs but also the inputs that it encountered earlier.

Recurrent Neural Networks

Although RNNs achieved remarkable results in speech recognition, language translation, and connected handwriting recognition, these network architectures are typically much harder to train. This is because we cannot simply backpropagate the error layer by layer; we have to consider the additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, 1997.

However, we should note that there are many different variants of RNNs, and a detailed discussion is beyond the scope of this book.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an additional time variable. The time component and dynamic structure allows the network to use not only the current inputs but also the inputs that it encountered earlier.

Recurrent Neural Networks

Although RNNs achieved remarkable results in speech recognition, language translation, and connected handwriting recognition, these network architectures are typically much harder to train. This is because we cannot simply backpropagate the error layer by layer; we have to consider the additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, 1997.

However, we should note that there are many different variants of RNNs, and a detailed discussion is beyond the scope of this book.

A few last words about neural network implementation

You might be wondering why we went through all of this theory just to implement a simple multi-layer artificial network that can classify handwritten digits instead of using an open source Python machine learning library. One reason is that at the time of writing this book, scikit-learn does not have an MLP implementation. More importantly, we (machine learning practitioners) should have at least a basic understanding of the algorithms that we are using in order to apply machine learning techniques appropriately and successfully.

Now that we know how feedforward neural networks work, we are ready to explore more sophisticated Python libraries built on top of NumPy such as Theano (http://deeplearning.net/software/theano/), which allows us to construct neural networks more efficiently. We will see this in Chapter 13, Parallelizing Neural Network Training with Theano. Over the last couple of years, Theano has gained a lot of popularity among machine learning researchers, who use it to construct deep neural networks because of its ability to optimize mathematical expressions for computations on multi-dimensional arrays utilizing Graphical Processing Units (GPUs).

A great collection of Theano tutorials can be found at http://deeplearning.net/software/theano/tutorial/index.html#tutorial.

There are also a number of interesting libraries that are being actively developed to train neural networks in Theano, which you should keep on your radar:

Summary

In this chapter, you have learned about the most important concepts behind multi-layer artificial neural networks, which are currently the hottest topic in machine learning research. In Chapter 2, Training Machine Learning Algorithms for Classification, we started our journey with simple single-layer neural network structures and now we have connected multiple neurons to a powerful neural network architecture to solve complex problems such as handwritten digit recognition. We demystified the popular backpropagation algorithm, which is one of the building blocks of many neural network models that are used in deep learning. After learning about the backpropagation algorithm, we were able to update the weights of such a complex neural network. We also added useful modifications such as mini-batch learning and an adaptive learning rate that allows us to train a neural network more efficiently.