In semantic segmentation, the goal is to label each individual pixel of an image according to what object class that pixel belongs to. The final result is a bitmap where each pixel will belong to a certain class:
There are several popular CNN architectures that have been shown to do well at the segmentation task. Most of them are variants of a class of model called an autoencoder, which we will look at in detail in Chapter 6, Autoencoders, Variational Autoencoders, and Generative Models. For now, their basic idea is to first spatially reduce the input volume to some compressed representation and then recover the original spatial size:
In order to increase the spatial size, there are some common operations that are used, which include the following:
- Max Unpooling
- Deconvolution/Transposed Convolution
- Dilated/Atrous Convolution
There's also a new variant of softmax that is used in the semantic segmentation task that we will learn about, which is called spatial softmax.
In this...