Summary
In this chapter, we learned about policy-based methods, principally the drawbacks to value-based methods such as Q-learning, which motivate the use of policy gradients. We discussed the purposes of policy-based methods of RL, along with the trade-offs of other RL approaches.
You learned about the policy gradients that help a model to learn in a real-time environment. Next, we learned how to implement the DDPG using the actor-critic model, the ReplayBuffer
class, and Ornstein–Uhlenbeck noise to understand the continuous action space. We also learned how you can improve policy gradients by using techniques such as TRPO and PPO. Finally, we talked in brief about the A2C method, which is an advanced version of the actor-critic model.
Also, in this chapter, we played around with the Lunar Lander environment in OpenAI Gym—for both continuous and discrete action spaces—and coded the multiple policy-based RL approaches that we discussed.
In the next chapter...