Effective training techniques

In this section, we will explore several techniques that help us to train the neural network quickly. We will look at techniques such as preprocessing the data to have a similar scale, to randomly initializing the weights to avoid exploding or vanishing gradients, and more effective activation functions besides the sigmoid function.

We begin with the normalization of the data and then we'll gain some intuition on how it works. Suppose we have two features, X1 and X2, taking a different range of values—X1 from 2 to 5, and X2 from 1 to 2—which is depicted in the following diagram:

We will begin by calculating the mean for each of the features using the following formula:

After that, we'll subtract the mean from the appropriate features using the following formula:

The output attained will be as follows:

Features that have a similar value to the mean will be centered around the 0, and those having different values will be far away from the mean.

The problem that still persists is the variant. has greater variance than now. In order to solve the problem, we'll calculate the variance using the following formula:

This is the average of the square of the zero mean feature, which is the feature that we subtracted on the previous step. We'll then calculate the standard deviation, which is given as follows:

This is graphically represented as follows:

Notice how, in this graph, is taking almost approximately the same variance as .

Normalizing the data helps the neural network to work faster. If we plot the weights and the cost function j for normalized data, we'll get a three-dimensional, non-regular screenshot as follows:

If we plot the contour in a two-dimensional plane, it may look something like the following skew screenshot:

Observe that the model may take different times to go to the minimum; that is, the red point marked in the plot.

If we consider this example, we can see that the cost values are oscillating between a different range of values, therefore taking a lot of time to go to the minimum.

To reduce the effect of the oscillating values, sometimes we need to lower the alpha learning rate, which means that we take even smaller steps. The reason we lower the learning rate is to avoid a convergence. Converging is like taking these kinds of values and never reaching the minimum value, as shown in the following plot:

Plotting the same data with normalization will give you a graph as follows:

So we get a model that is regular or spherical in shape, and if we plot it in a two-dimensional plane, it will give a more rounded graph:

Here, regardless of where you initialize the data, it will take the same time to get to the minimum point. Look at the following diagram; you can see that the values are stable:

I think it is now safe to conclude that normalizing the data is very important and harmless. So, if you are not sure whether to do it or not, it's always a better idea to do it than avoid it.

Initializing the weights

We are already aware that we have no weight values at the beginning. In order to solve that problem, we will initialize the weights with random non-zero values. This might work well, but here, we are going to look at how initializing weights greatly impacts the learning time of our neural network.

Suppose we have a deep neural network with many hidden layers, and each of these high layers is connected to two neurons. For the sake of simplicity, we'll not take the sigmoid function but the identity activation function, which simply leaves the input untouched. The value is given by F(z), or simply Z:

Assume that we have weights as depicted in the previous diagram. Calculating the Z at the hidden layer and the neuron or the activation values are the same because of the identity function. The first neuron in the first hidden layer will be 1*0.5+1*0, which is 0.5. The same applies for the second neuron. When we move to the second hidden layer, the value of Z for this second hidden layer is 0.5 *0.5+0.5 *0, which gives us 0.25 or 1/4; if we continue the same logic, we'll have 1/8, 1/16, and so on, until we have the formula . What this tells us is that the deeper our neural network becomes, the smaller this activation value gets. This concept is also called the vanishing gradient. Originally, the concept referred to the gradient rather than activation values, but we can easily adapt it to gradients and the concept holds the same. If we replace the 0.5 with a 1.5, then we will have in the end, which tells us that the deeper our neural network gets, the greater the activation function becomes. This is known as the exploding gradient values.

In order to avoid both situations, we may want to replace the zero value with a 0.5. If we do that, the first neuron in the first hidden layer will have the value 1*0.5+1*0.5, which is equal to 1. This does not really help our cause because our output is then equal to the input, so maybe we can slightly modify to have not 0.5, but a random value that is as near to 0.5 as possible.

In a way, we would like to have weights valued with a variance of 0.5. More formally, we want the variance of the weights to be 1 divided by the number of neurons in the previous layer, which is mathematically expressed as follows:

To obtain the actual values, we need to multiply the square root of the variance formula to a normal distribution of random values. This is known as the Xavier initialization:

If we replace the 1 with 2 in this formula, we will have even greater performance for our neural network. It'll converge faster to the minimum.

We may also find different versions of the formula. One of them is the following:

It modifies the term to have the multiplication of the number of neurons in the actual layer with the number of neurons in the previous layer.

Activation functions

We've learned about the sigmoid function so far, but it is used comparatively less in the modern era of deep learning. The reason for this is because the tanh function works much better than the sigmoid function. The tanh function is grahically represented as follows:

If you look at the graph, you can see that this function looks similar to the sigmoid function, but is centered at the zero. The reason it works better is because it's easier to center your data around 0 than around 0.5.

However, they both share a downside: when the weights become bigger or smaller, this slope in the graph becomes smaller, to almost zero, and that slows down our neural network a lot. In order to overcome this, we have the ReLU function, which always guarantees a slope.

The ReLU function is one of the reasons we can afford to have deeper neural networks with high efficiency. It has also become the default application for all neural networks. The ReLU function is graphically represented as follows:

A small modification to the ReLU function, will lead us to the leaky ReLU function that is shown in the next graph; here, instead of taking zero, it takes a small value:

So sometimes, this works better than the ReLU function but most of the time, actually a ReLU function works just fine.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Hands-On Java Deep Learning for Computer Vision

By : Klevis Ramo

Hands-On Java Deep Learning for Computer Vision

By: Klevis Ramo

Overview of this book

Effective training techniques

Initializing the weights

Activation functions

Hands-On Java Deep Learning for Computer Vision

By : Klevis Ramo

Hands-On Java Deep Learning for Computer Vision

By: Klevis Ramo

Overview of this book

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access