Deep Learning Quick Reference

By : Mike Bernico

Deep Learning Quick Reference

By: Mike Bernico

Overview of this book

Deep learning has become an essential necessity to enter the world of artificial intelligence. With this book deep learning techniques will become more accessible, practical, and relevant to practicing data scientists. It moves deep learning from academia to the real world through practical examples. You will learn how Tensor Board is used to monitor the training of deep neural networks and solve binary classification problems using deep learning. Readers will then learn to optimize hyperparameters in their deep learning models. The book then takes the readers through the practical implementation of training CNN's, RNN's, and LSTM's with word embeddings and seq2seq models from scratch. Later the book explores advanced topics such as Deep Q Network to solve an autonomous agent problem and how to use two adversarial networks to generate artificial images that appear real. For implementation purposes, we look at popular Python-based deep learning frameworks such as Keras and Tensorflow, Each chapter provides best practices and safe choices to help readers make the right decision while training deep neural networks. By the end of this book, you will be able to solve real-world problems quickly with deep neural networks.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

The Building Blocks of Deep Learning

The deep neural network architectures

Optimization algorithms for deep learning

Deep learning frameworks

Building datasets for deep learning

Summary

Using Deep Learning to Solve Regression Problems

Regression analysis and deep neural networks

Using deep neural networks for regression

Building an MLP in Keras

Building a deep neural network in Keras

Saving and loading a trained Keras model

Summary

Monitoring Network Training Using TensorBoard

A brief overview of TensorBoard

Setting up TensorBoard

Connecting Keras to TensorBoard

Using TensorBoard

Summary

Using Deep Learning to Solve Binary Classification Problems

Binary classification and deep neural networks

Case study – epileptic seizure recognition

Building a binary classifier in Keras

Using the checkpoint callback in Keras

Measuring ROC AUC in a custom callback

Measuring precision, recall, and f1-score

Summary

Using Keras to Solve Multiclass Classification Problems

Multiclass classification and deep neural networks

Case study - handwritten digit classification

Building a multiclass classifier in Keras

Controlling variance with dropout

Controlling variance with regularization

Summary

Hyperparameter Optimization

Should network architecture be considered a hyperparameter?

Which hyperparameters should we optimize?

Hyperparameter optimization strategies

Summary

Training a CNN from Scratch

Introducing convolutions

Training a convolutional neural network in Keras

Using data augmentation

Summary

Transfer Learning with Pretrained CNNs

Overview of transfer learning

When transfer learning should be used

The impact of source/target volume and similarity

Transfer learning in Keras

Summary

Training an RNN from scratch

Introducing recurrent neural networks

A refresher on time series problems

Using an LSTM for time series prediction

Summary

Training LSTMs with Word Embeddings from Scratch

An introduction to natural language processing

Vectorizing text

Word embedding

Keras embedding layer

1D CNNs for natural language processing

Case studies for document classifications

Summary

Training Seq2Seq Models

Sequence-to-sequence models

Machine translation

Summary

Using Deep Reinforcement Learning

Reinforcement learning overview

The Keras reinforcement learning framework

Building a reinforcement learning agent in Keras

Summary

Generative Adversarial Networks

An overview of the GAN

Deep Convolutional GAN architecture

How GANs can fail

Safe choices for GAN

Generating MNIST images using a Keras GAN

Generating CIFAR-10 images using a Keras GAN

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The deep neural network architectures

The deep neural network architectures can vary greatly in structure depending on the network's application, but they all have some basic components. In this section, we will talk briefly about those components.

In this book, I'll define a deep neural network as a network with more than a single hidden layer. Beyond that we won't attempt to limit the membership to the Deep Learning Club. As such, our networks might have less than 100 neurons, or possibly millions. We might use special layers of neurons, including convolutions and recurrent layers, but we will refer to all of these as neurons nonetheless.

Neurons

A neuron is the atomic unit of a neural network. This is sometimes inspired by biology; however, that's a topic for a different book. Neurons are typically arranged into layers. In this book, if I'm referring to a specific neuron, I'll use the notation where l is the layer the neuron is in and k is the neuron number. As we will be using programming languages that observe 0th notation, my notation will also be 0th based.

At their core, most neurons are composed of two functions that work together: a linear function and an activation function. Let us take a high-level look at those two components.

The neuron linear function

The first component of the neuron is a linear function whose output is the sum of the inputs, each multiplied by a coefficient. This function is really more or less a linear regression. These coefficients are typically referred to as weights in neural network speak. For example, given some neuron with the input features of x1, x2, and x3, and output z, this linear component or the neuron linear function would simply be:

Where are weights or coefficients that we will need to learn given the data and b is a bias term.

Neuron activation functions

The second function of the neuron is the activation function, which is tasked with introducing a nonlinearity between neurons. A commonly used activation is the sigmoid activation, which you may be familiar with from logistic regression. It squeezes the output of the neuron into an output space where very large values of z are driven to 1 and very small values of z are driven to 0.

The sigmoid function looks like this:

It turns out that the activation function is very important for intermediate neurons. Without it one could prove that a stack of neurons with linear activation's (which is really no activation, or more formally an activation function where z=z) is really just a single linear function.

A single linear function is undesirable in this case because there are many scenarios where our network may be under specified for the problem at hand. That is to say that the network can't model the data well because of non-linear relationships present in the data between the input features and target variable (what we're predicting).

The canonical example of a function that cannot be modeled with a linear function is the exclusive OR function, which is shown in the following figure:

Other common activation functions are the tanh function and the ReLu or Rectilinear Activation.

The hyperbolic tangent or the tanh function looks like this:

The tanh usually works better than sigmoid for intermediate layers. As you can probably see, the output of tanh will be between [-1, 1], whereas the output of sigmoid is [0, 1]. This additional width provides some resilience from a phenomenon known as the vanishing/exploding gradient problem, which we will cover in more detail later. For now, it's enough to know that the vanishing gradient problem can cause networks to converge very slowly in the early layers, if at all. Because of that, networks using tanh will tend to converge somewhat faster than networks that use sigmoid activation. That said, they are still not as fast as ReLu.

ReLu, or Rectilinear Activation, is defined simply as:

It's a safe bet and we will use it most of the time throughout this book. Not only is ReLu easy to compute and differentiate, it's also resilient against the vanishing gradient problem. The only drawback to ReLu is that it's first derivative is undefined at exactly 0. Variants including leaky ReLu, are computationally harder, but more robust against this issue.

For completeness, here's a somewhat obvious graph of ReLu:

The loss and cost functions in deep learning

Every machine learning model really starts with a cost function. Simply, a cost function allows you to measure how well your model is fitting the training data. In this book, we will define the loss function as the correctness of fit for a single observation within the training set. The cost function will then most often be an average of the loss across the training set. We will revisit loss functions later when we introduce each type of neural network; however, quickly consider the cost function for linear regression as an example:

In this case, the loss function would be , which is really the squared error. So then J, our cost function, is really just the mean squared error, or an average of the squared error across the entire dataset. The term 1/2 is added to make some of the calculus cleaner by convention.

The forward propagation process

Forward propagation is the process by which we attempt to predict our target variable using the features present in a single observation. Imagine we had a two-layer neural network. In the forward propagation process, we would start with the features present within that observation and then multiply those features by their associated coefficients within layer 1 and add a bias term for each neuron. After that, we would send that output to the activation for the neuron. Following that, the output would be sent to the next layer, and so on, until we reach the end of the network where we are left with our network's prediction:

The back propagation function

Once forward propagation is complete, we have the network's prediction for each data point. We also know that data point's actual value. Typically, the prediction is defined as while the actual value of the target variable is defined as y.

Once both y and are known, the network's error can be computed using the cost function. Recall that the cost function is the average of the loss function.

In order for learning to occur within the network, the network's error signal must be propagated backwards through the network layers from the last layer to the first. Our goal in back propagation is to propagate this error signal backwards through the network while using it to update the network weights as the signal travels. Mathematically, to do so we need to minimize the cost function by nudging the weights towards values that make the cost function the smallest. This process is called gradient descent.

The gradient is the partial derivative of the error function with respect to each weight within the network. The gradient of each weight can be calculated, layer by layer, using the chain rule and the gradients of the layers above.

Once the gradients of each layer are known, we can use the gradient descent algorithm to minimize the cost function.

The Gradient Descent will repeat this update until the network's error is minimized and the process has converged:

The gradient descent algorithm multiples the gradient by a learning rate called alpha and subtracts that value from the current value of each weight. The learning rate is a hyperparameter.

Stochastic and minibatch gradient descents

The algorithm describe in the previous section assumes a forward and corresponding backwards pass over the entire dataset and as such it's called batch gradient descent.

Another possible way to do gradient descent would be to use a single data point at a time, updating the network weights as we go. This method might help speed up convergence around saddle points where the network might stop converging. Of course, the error estimation of only a single point may not be a very good approximation of the error of the entire dataset.

The best solution to this problem is using mini batch gradient descent, in which we will take some random subset of the data called a mini batch to compute our error and update our network weights. This is almost always the best option. It has the additional benefit of naturally splitting a very large dataset into chunks that are more easily managed in the memory of a machine, or even across machines.

This is an extremely high-level description of one of the most important parts of a neural network, which we believe fits with the practical nature of this book. In practice, most modern frameworks handle these steps for us; however, they are most certainly worth knowing at least theoretically. We encourage the reader to go deeper into forward and backward propagation as time permits.

Deep Learning Quick Reference

By : Mike Bernico

Deep Learning Quick Reference

By: Mike Bernico

Overview of this book

Related Content you might be interested in

Current Title:

Deep Learning Quick Reference

Keras Deep Learning Cookbook

Machine Learning for Finance

Deep Learning with Keras