Policy gradient methods on CartPole
Nowadays, almost nobody uses the vanilla policy gradient method, as the much more stable actor-critic method exists. However, I still want to show the policy gradient implementation, as it establishes very important concepts and metrics to check the policy gradient method's performance.
The complete code for the following example is available in
GAMMA = 0.99
LEARNING_RATE = 0.001
ENTROPY_BETA = 0.01
BATCH_SIZE = 8
REWARD_STEPS = 10
Besides the already familiar hyperparameters, we have two new ones: the
ENTROPY_BETA value is the scale of the entropy bonus and the
REWARD_STEPS value specifies how many steps ahead the Bellman equation is unrolled to estimate the discounted total reward of every transition.