Let's discuss them one by one.
Notes on Hyperparameters
While some of these hyperparameters have existed in the standard single-node training process, in data parallel training, these parameters may have new searching dimensions and new correlations.
Global batch size
The global batch size refers to how many training samples will be loaded into all the GPUs for training simultaneously. The counterpart of this concept in single-node training is the batch size or mini-batch.
Selecting the proper global batch size is different from selecting a single node's batch size. In single-node training, we always set the batch size to be the maximum number that can fit into the accelerator's memory without causing out-of-memory (OOM) issues. In data parallel training, given N GPUs, we may not set the global batch-size to be N*Max(single_node), where Max(single_node) refers to the maximum batch size on a single GPU.
In data parallel training, this global batch size is the first hyperparameter we need to search or fine-tune. If the global batch size is too large, the training model may not converge. If the global batch size is too small, it is just a waste of distributed computational resources.
Learning rate adjustment
Rule of Thumb Regarding Learning Rate Adjustment
The rule of thumb policy for determining the learning rate in data parallel training is to multiply the learning rate in the single-node case by N, if we use N GPUs to do the data parallel training together.
Recent research literature suggests that, for large-batch data parallel training, we should have a warmup stage at the very beginning of the training stage. This warmup policy suggests that we should start data parallel training with a relatively small learning rate. After this warmup period, we should gradually increase the learning rate for several epochs of training, and then stop increasing the learning rate by defining a peak learning rate.
Model synchronization schemes
Now that we have chosen our optimizer (global batch size) and adjusted the learning rate accordingly, the next thing we need to do is select an appropriate model synchronization model to use. We need this because we need to initialize a group of processes to run our data parallel training job in a distributed manner, where each process will be responsible for handling model synchronization on one machine or one GPU.
pytorch as an example. Here, you need to initialize your process groups, as follows:
torch.distributed.init_process_group(backend='nccl', init_method = '...', world_size = N, timeout = M)
Here, the first parameter (
backend='nccl') we need to choose from is the model synchronization backend. Right now, deep learning platforms such as PyTorch mainly support three different communication backends: NCCL, Gloo, and MPI.
Among these three, the following are some high-level suggestions on selecting communication schemes:
- For GPU clusters, use NCCL.
- For CPU clusters, use Gloo first. If that doesn't work, try MPI.
With that, we have discussed three main communication schemes we can use in data parallel training jobs. Since the nodes we have used for model training are GPUs, we usually set NCCL as our default communication backend.