-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Distributed Machine Learning with Python
By :
As mentioned in a huge number of papers from academia and technical reports from the industry, vanilla model parallelism is very inefficient regarding GPU computation and memory utilization. To illustrate why vanilla model parallelism is not efficient, let's look at a simple DNN model, which is shown in Figure 6.1:
Figure 6.1 – A simple NLP model with three layers
As shown in Figure 6.1, given the training input, we pass it into our three-layer NLP model. The layers are denoted as Layer 1, Layer 2, and Layer 3. After the forward propagation, the model will generate some output.
Now let's assume we use three GPUs. Each GPU only holds one layer of the original model. It is shown in Figure 6.2:
Figure 6.2 – Model partition on three GPUs
In Figure 6.2, we have GPU1 holding Layer 1 of the model. Similarly, we have GPU2 holding Layer 2 and GPU3 holding Layer 3.
Now, we...