Another well-known optimization for CNNs is batch normalization. This technique normalizes the inputs of the current batch before feeding it to the next layer; therefore, the mean activation for each batch is around zero and the standard deviation around one, and we can avoid internal covariate shift. By doing this, the input distribution of the data per batch has less effect on the network, and as a consequence the model is able to generalize better and train faster.
In the following recipe, we'll show you how to apply batch normalization to an image dataset with 10 classes (CIFAR-10). First, we train the network architecture without batch normalization to demonstrate the difference in performance.