Engineering model parallel training jobs
In model parallelism, a single copy of the model is distributed across two or more training devices to avoid the memory limitations of a single GPU device. A simple method of model parallelism is to explicitly assign layers of the model onto different devices. In this case, forward pass computations will be performed on the GPU device storing the first set of layers. Then, the results will be transferred to the GPU device storing the next set of layers, and so on. The handoff between layers will happen in reverse order during the backward pass. This type of model parallelism is known as naïve model parallelism or vertical model parallelism because we split the model vertically between devices. However, this type of model parallelism is inefficient, as each GPU device will wait for a significant amount of time for other devices to complete their computations. A more efficient way to organize model parallelism is called Pipeline Parallelism...