Communication bottlenecks in data parallel training
As we mentioned in Chapter 2, Parameter Server and All-Reduce, and Chapter 3, Building a Data Parallel Training and Serving Pipeline, we need to conduct a communication-heavy step, namely model synchronization, after each training iteration.
In this section, we will conduct the theoretical analysis for the total traffic needs that are to be transferred over the network. Then, we will identify network inefficiency in widely used communication protocols such as NCCL and Gloo.
Analyzing the communication workloads
- Aggregating all the gradients that have been generated from all the workers
- Updating the model weights of all the workers
Some notations that will be used in this section are as follows:
g_i: The local gradients that are generated from a single worker...