In the previous chapters, we discussed the DQN for playing Atari games and the use of the DPG and TRPO algorithms for continuous control tasks. Recall that DQN has the following architecture:
At each timestep
, the agent observes the frame image
and selects an action
based on the current learned policy. The emulator (the Minecraft environment) executes this action and returns the next frame image
and the corresponding reward
. The quadruplet
is then stored in the experience memory and is taken as a sample for training the Q-network by minimizing the empirical loss function via stochastic gradient descent.
Deep reinforcement learning algorithms based on experience replay have achieved unprecedented success in playing Atari games. However, experience replay has several disadvantages:
- It uses more memory and computation per real interaction
- It requires off-policy learning algorithms that can update from data generated by an older policy
In order...