Meta-reinforcement learning with recurrent policies
In this section, we cover one of the more intuitive approaches in meta-reinforcement learning that uses recurrent neural networks to keep a memory, also known as the RL2 algorithm. Let's start with an example to motivate this approach.
Grid world example
Consider a grid world where the agent's task is to reach a goal state G from a start state S. These states are randomly placed for different tasks, so the agent has to learn exploring the world to discover where the goals state is, which then is given a big reward. When the same task is repeated, the agent is expected to quickly reach the goal state, which is, adapt to the environment, since there is a penalty incurred for each time step. This is described in Figure 12.1.