Gated Recurrent Units (GRUs)
GRUs help the network to remember long-term dependencies in an explicit manner. This is achieved by introducing more variables in the structure of a simple RNN.
So, what will help us to get rid of the vanishing gradients problem? Intuitively speaking, if we allow the network to transfer most of the knowledge from the activation function of the previous timesteps, then an error can be backpropagated more faithfully than a simple RNN case. If you are familiar with residual networks for image classification, then you will recognize this function as being similar to that of a skip connection. Allowing the gradient to backpropagate without vanishing enables the network to learn more uniformly across layers and, hence, eliminates the issue of gradient instability:
Figure 6.6: The full GRU structure
The different signs in the preceding diagram are as follows:
Figure 6.7: The meanings of the different signs in the GRU diagram
Note
The Hadamard...