The spatial transformer layer is an original module to localize an area of the image, crop it and resize it to help the classifier focus on the relevant part in the image, and increase its accuracy. The layer is composed of differentiable affine transformation, for which the parameters are computed through another model, the localization network, and can be learned via backpropagation as usual.
An example of the application to reading multiple digits in an image can be inferred with the use of recurrent neural units. To simplify our work, the Lasagne library was introduced.
Spatial transformers are one solution among many others for localizations; region-based localizations, such as YOLO, SSD, or Faster RCNN, provide state-of-the-art results for bounding box prediction.
In the next chapter, we'll continue with image recognition to discover how to classify full size images that contain a lot more information than digits, such as natural images of indoor scenes and outdoor landscapes...