In Chapter 9, Policy Gradients – An Alternative, we started to investigate an alternative to the familiar value-based methods family, called policy-based. In particular, we focused on the method called REINFORCE and its modification that uses a discounted reward to obtain the gradient of the policy (which gives us the direction to improve the policy). Both methods worked well for a small CartPole problem, but for a more complicated Pong environment, the convergence dynamic was painfully slow.
In this chapter, we'll discuss one more extension to the vanilla Policy Gradient (PG) method, which magically improves the stability and convergence speed of the new method. Despite the modification being only minor, the new method has its own name, Actor-Critic, and it's one of the most powerful methods in deep Reinforcement Learning (RL).