Improving Policy Gradients
In this section, we will learn the various approaches that will help us improve the policy gradient approach that we learned about in the previous section. We will learn about techniques such as TRPO and PPO.
We will also learn about the A2C technique in brief. Let's understand the TRPO optimization technique in the next section.
Trust Region Policy Optimization
In most cases, RL is very sensitive to the initialization of weights. Take, for instance, the learning rate. If our learning rate is too high, then it may so happen that our policy update takes our policy network to a region of the parameter space where the next batch of data it collects is gathered against a very poor policy. This might cause our network to never recover again. Now, we will talk about newer methods that try to get rid of this problem. But before we do that, let's have a quick recap of what we have already covered.
In the Policy Gradients section, we defined...