Improving network throughput with EFA
When training large DL models, you need to break your large training task into smaller tasks and distribute them across multiple compute devices. Distributed training includes the following key steps:
- Each device in the training cluster does the following:
- Reads a unique minibatch from the global data batch
- Runs a minibatch through the model and computes loss
- Computes the gradients to minimize loss
- Each device communicates gradients to its peers. Average gradients are computed.
- Each device updates the model according to the averaged gradients.
To measure the efficiency of distributed training, we can use the scaling factor, which is defined as follows:
Here, T is the throughput of a single device, n is the number of devices in the training cluster, and nT is the achieved overall throughput of your training cluster. While ideal scaling is rarely achievable (meaning adding more resources proportionally...