Book Image

Practical Convolutional Neural Networks

By : Mohit Sewak, Md. Rezaul Karim, Pradeep Pujari
Book Image

Practical Convolutional Neural Networks

By: Mohit Sewak, Md. Rezaul Karim, Pradeep Pujari

Overview of this book

Convolutional Neural Network (CNN) is revolutionizing several application domains such as visual recognition systems, self-driving cars, medical discoveries, innovative eCommerce and more.You will learn to create innovative solutions around image and video analytics to solve complex machine learning and computer vision related problems and implement real-life CNN models. This book starts with an overview of deep neural networkswith the example of image classification and walks you through building your first CNN for human face detector. We will learn to use concepts like transfer learning with CNN, and Auto-Encoders to build very powerful models, even when not much of supervised training data of labeled images is available. Later we build upon the learning achieved to build advanced vision related algorithms for object detection, instance segmentation, generative adversarial networks, image captioning, attention mechanisms for vision, and recurrent models for vision. By the end of this book, you should be ready to implement advanced, effective and efficient CNN models at your professional project or personal initiatives by working on complex image and video datasets.
Table of Contents (11 chapters)

Understanding backpropagation 

In this section, we will understand an intuition about backpropagation. This is a way of computing gradients using the chain rule. Understanding this process and its subtleties is critical for you to be able to understand and effectively develop, design, and debug neural networks.

In general, given a function f(x), where x is a vector of inputs, we want to compute the gradient of f at x denoted by ∇(f(x)). This is because in the case of neural networks, the function f is basically a loss function (L) and the input x is the combination of weights and training data. The symbol   is pronounced as nabla:

(xi, yi ) i = 1......N

Why do we take the gradient on weight parameters?

It is given that the training data is usually fixed and the parameters are variables that we have control over. We usually compute the gradient of the parameters so that we can use it for parameter updates. The gradient ∇f is the vector of partial derivatives, that is:

∇f = [ df/dx, df/dy] = [y,x]

In a nutshell, backpropagation will consist of:

  • Doing a feed-forward operation
  • Comparing the output of the model with the desired output
  • Calculating the error
  • Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights
  • Using this to update the weights, and get a better model
  • Continuing this until we have a model that is good

We will be building a neural network that recognizes digits from 0 to 9. This kind of network application is used for sorting postal mail by zip code, recognizing phone numbers and house numbers from images, extracting package quantities from image of the package and so on.

In most cases, backpropagation is implemented in a framework, such as TensorFlow. However, it is not always true that by simply adding an arbitrary number of hidden layers, backpropagation will magically work on the dataset. The fact is if the weight initialization is sloppy, these non linearity functions can saturate and stop learning. That means training loss will be flat and refuse to go down. This is known as the vanishing gradient problem

If your weight matrix W is initialized too large, the output of the matrix multiply too could probably have a very large range, which in turn will make all the outputs in the vector z almost binary: either 1 or 0. However, if this is the case, thenz*(1-z), which is the local gradient of the sigmoid non-linearity, will become zero (vanish) in both cases, which will make the gradient for both x and W also zero. The rest of the backward pass will also come out all zero from this point onward on account of the multiplication in the chain rule.

Another nonlinear activation function is ReLU, which thresholds neurons at zero shown as follows. The forward and backward pass for a fully connected layer that uses ReLU would at the core include:

z = np.maximum(0, np.dot(W, x)) #Representing forward pass
dW = np.outer(z > 0, x) #Representing backward pass: local gradient for W

If you observe this for a while, you'll see that should a neuron get clamped to zero in the forward pass (that is, z = 0, it doesn't fire), then its weights will get a zero gradient. This can lead to what is called the dead ReLU problem. This means if a ReLU neuron is unfortunately initialized in such a way that it never fires, or if a neuron's weights ever get knocked off with a large update during training into this regime, in such cases this neuron will remain permanently dead. It is similar to permanent, irrecoverable brain damage. Sometimes, you can even forward the entire training set through a trained network and finally realize that a large fraction (about 40%) of your neurons were zero the entire time.

In calculus, the chain rule is used for computing the derivative of the composition of two or more functions. That is, if we have two functions as and g, then the chain rule represents the derivative of their composition f ∘ g. The function that maps x to f(g(x))) in terms of the derivatives of f and g and the product of functions is expressed as follows:

There is a more explicit way to represent this in terms of the variable. Let F = f ∘ g, or equivalently, F(x) = f(g(x)) for all x. Then one can also write:

F'(x)=f'(g(x))g'(x).

The chain rule can be written with the help of Leibniz's notation in the following way. If a variable z is dependent on a variable y, which in turn is dependent on a variable x (such that y and z are dependent variables), then z depends on x as well via the intermediate y. The chain rule then states:

z = 1/(1 + np.exp(-np.dot(W, x))) # forward pass
dx = np.dot(W.T, z*(1-z)) # backward pass: local gradient for x
dW = np.outer(z*(1-z), x) # backward pass: local gradient for W

The forward pass on the left in the following figure calculates z as a function f(x,y) using the input variables x and y. The right side of the figures represents the backward pass. Receiving dL/dz, the gradient of the loss function with respect to z, the gradients of x and y on the loss function can be calculated by applying the chain rule, as shown in the following figure: