Deep convolutional Q-learning
In the chapter on deep Q-learning (Chapter 9, Going Pro with Artificial Brains – Deep Q-Learning), our inputs were vectors of encoded values defining the states of the environment. When working with images or videos, encoded vectors aren't the best inputs to describe a state (the input frame), simply because an encoded vector doesn't preserve the spatial structure of an image. The spatial structure is important because it gives us more information to help predict the next state, and predicting the next state is essential for our AI to learn the correct next move.
Therefore, we need to preserve the spatial structure. To do that, our inputs must be 3D images (2D for the array of pixels plus one additional dimension for the colors, as illustrated at the beginning of this chapter). For example, if we train an AI to play a video game, the inputs are simply the images of the screen itself, exactly what a human sees when playing...