A3C with data parallelism
The first version of A3C parallelization that we will check (which was outlined in Figure 13.2) has both one main process that carries out training and several child processes communicating with environments and gathering experience to train on.
Implementation
For simplicity and efficiency, the NN weights broadcasting from the trainer process are not implemented. Instead of explicitly gathering and sending weights to child processes, the network is shared between all processes using PyTorch built-in capabilities, allowing us to use the same nn.Module
instance with all its weights in different processes by calling the share_memory()
method on NN creation. Under the hood, this method has zero overhead for CUDA (as GPU memory is shared among all the host's processes), or shared memory inter-process communication (IPC) in the case of CPU computation. In both cases, the method improves performance, but limits our example of one single machine using one...