Hands-On Java Deep Learning for Computer Vision

Through the course of this chapter, we have learned several optimization techniques. These, in combination with several other parameters, can help speed up the learning time of our neural network.

In this section of the chapter, we are going to look at a range of parameters, focus on the parameters that are most likely to produce good results when changed, and learn how to tune these parameters to obtain the the best possible outcome.

Here are the parameters that we have looked at so far:

Data input normalization: Data input normalization is more of a preprocessing technique than it is a parameter. We mention it on this list because it is, in a manner of speaking, a mandatory step. The second reason data input normalization belongs in this list is merely because it is essential for batch normalization. Batch normalization not only normalizes the input, but also the hidden layer inputs and the Z-values as we have observed in the previous sections. This method has led the neural network to learn how to normalize the hidden layer input according to the best bit. Fortunately, we do not need to worry about the and parameter, as the network learns these values automatically.
learning rate: The one parameter that always needs attention is the learning rate. As stated in the last section of this chapter, the learning rate defines how quickly our neural network will learn, and usually it takes values such as -0.1, 0.01,0.001,0.00001, and 0.000001. We also saw how a neural network organizes matrices for greater performance. This is only because matrix operations offer a high level of parallelism.
Mini-batch size and the number of epochs: The mini-batch size is the number of inputs that can be fed to the neural network before the weights are updated or before moving toward the minimum. The mini-batch size, therefore, directly affects the level of parallelism. The batch size depends on the hardware used and is defined as k-number of CPU cores or GPU units. The batch size for a CPU core could be 4, 8, 16, or maybe 32, depending on the hardware. For a GPU, this value is much greater, such as 256, or maybe 512, or even 1,024, depending on the model of the graphic card.
The number of neurons in the hidden layer: This value increases the number of weights and the weight combinations, therefore enabling us to create and learn complex models, which in turn helps us solve complex problems. The reason we find this parameter so far down the list is because most of the time this number can be taken from literature and well-known architectures, so we don't have to tune this ourselves. There maybe a rare few cases where we would need to change this value based on our personal needs. This number could vary from hundreds to thousands; some deep networks have 9,000 neurons in the hidden layer.
The number of hidden layers: Increasing the number of hidden layers would lead to a dramatic increase in the number of weights, since it would actually define how deep the neural network is. The number of hidden layers can vary from 2 to 22 to 152, where 2 would be the simplest network and 152 hidden layers would be a really deep neural network. During the course of this book, we will take a look at creating a deep neural network using transfer learning.
learning rate decay: The learning rate decay is a technique to load the learning rate as we train our neural network for longer periods of time. The reason we want to implement this is because when we use the mini-batch gradient descent; we do not go straight to the minimum value. The oscillating values and the nature of the batch itself lead us to not consider the example itself, but just a subset of it. To lower this value, we use a simple formula:

Observe how when the epoch number increases, the value of learning rate decay becomes less than 1, but when multiplied by , we reduce the effect of these values. The significance of the decay rate in this formula is to just accelerate the reduction of this alpha when the epoch number increases.

momentum parameter: The momentum parameter lies in the range of 0.8 to 0.95. This parameter rarely needs to be tuned.
ADAM , , : ADAM almost never needs tuning.

This list is ordered in a manner such that the first one has more of an effect on the outcome.

One of the things that is important when choosing the parameter values is carefully picking the scale. Consider an example where we have to set the number of neurons in the hidden layers, and by intuition, this number lies between 100 to 200. The reasonable thing to do is to uniformly and randomly pick a number in this segment or in this range of values. Unfortunately, this does not work for all the parameters.

To decide the learning rate, let us begin by assuming that the best value will likely be in the range of 0.1 to 1; in the image below, notice how 90% of our resources go to choosing values between 0.1 and 1. This does not sound right, since only 10% go to finding values in the remaining three ranges, 0.001-0.01-0.1. But since we do not have any preference, the values can be found equally in all these four ranges:

It would make sense to divide the segment into four equal parts and ranges and look for our value, uniformly and randomly. One way to do that efficiently is to look for random values in the range of -4 to 0, using the following code:

After this, we can return to the original scale by using 10 to the power of whatever this function produces as a value. Calling the same line of code four times, once for each segment, will work just fine:

Let us begin exploring the process of selecting the parameters. We have several parameters to tune, so the process may look like this random grid here:

For one random value of alpha, we can try different beta values and vice versa. In practice, we have more than two values. Look at the following block:

You can pick one random alpha, try several beta values, and then, for each of these beta values, you try varying the number of neurons in the hidden layers. This process can be adopted for an even greater number of parameters, such as four, five, and so on.

The other thing that can help is a more varied version of the original process:

During the fine-tuning of the parameters, we can observe that the highlighted bunch of values actually produce a better output. We look at this closely:

We can continue to do this until we have the required results.

Hands-On Java Deep Learning for Computer Vision

By : Klevis Ramo

Hands-On Java Deep Learning for Computer Vision

By: Klevis Ramo

Overview of this book

Configuring the training parameters of the neural network

Hands-On Java Deep Learning for Computer Vision

By : Klevis Ramo

Hands-On Java Deep Learning for Computer Vision

By: Klevis Ramo

Overview of this book

Related Content you might be interested in

Current Title:

Hands-On Java Deep Learning for Computer Vision

Hands-On Mathematics for Deep Learning

Java Deep Learning Projects

Practical Convolutional Neural Networks