Batch Normalization
It is typical to normalize the input layer in an attempt to speed up learning, as well as to improve performance by rescaling all the features to the same scale. So, the question is, if the model benefits from the normalization of the input layer, why not normalize the output of all the hidden layers in an attempt to improve the training speed even more?
Batch normalization, as its name suggests, normalizes the outputs from the hidden layers so that it reduces the variance from each layer, which is also known as covariance shift. This reduction of the covariance shift is useful as it allows the model to also work well on images that follow a different distribution than the images used to train it.
Take, for instance, a network that has the purpose of detecting whether an animal is a cat. When the network is trained only using images of black cats, batch normalization can help the network also classify new images of cats of different colors by normalizing...