Asynchronous Advantage Actor-Critic
This chapter is dedicated to the extension of the advantage actor-critic (A2C) method that we discussed in detail in Chapter 12, The Actor-Critic Method. The extension adds true asynchronous environment interaction, and its full name is asynchronous advantage actor-critic, which is normally abbreviated to A3C. This method is one of the most widely used by reinforcement learning (RL) practitioners.
We will take a look at two approaches for adding asynchronous behavior to the basic A2C method: data-level and gradient-level parallelism. They have different resource requirements and characteristics, which makes them applicable to different situations.
In this chapter, we will:
- Discuss why it is important for policy gradient methods to gather training data from multiple environments
- Implement two different approaches to A3C