Distributed Machine Learning with Python

By : Guanhua Wang

Distributed Machine Learning with Python

By: Guanhua Wang

Overview of this book

Reducing time cost in machine learning leads to a shorter waiting time for model training and a faster model updating cycle. Distributed machine learning enables machine learning practitioners to shorten model training and inference time by orders of magnitude. With the help of this practical guide, you'll be able to put your Python development knowledge to work to get up and running with the implementation of distributed machine learning, including multi-node machine learning systems, in no time. You'll begin by exploring how distributed systems work in the machine learning area and how distributed machine learning is applied to state-of-the-art deep learning models. As you advance, you'll see how to use distributed systems to enhance machine learning model training and serving speed. You'll also get to grips with applying data parallel and model parallel approaches before optimizing the in-parallel model training and serving pipeline in local clusters or cloud environments. By the end of this book, you'll have gained the knowledge and skills needed to build and deploy an efficient data processing pipeline for machine learning model training and inference in a distributed manner.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Data Parallelism

Free Chapter

Chapter 1: Splitting Input Data

Single-node training is too slow

Data parallelism – the high-level bits

Hyperparameter tuning

Summary

Chapter 2: Parameter Server and All-Reduce

Technical requirements

Parameter server architecture

Implementing the parameter server

Issues with the parameter server

All-Reduce architecture

Collective communication

Summary

Chapter 3: Building a Data Parallel Training and Serving Pipeline

Technical requirements

The data parallel training pipeline in a nutshell

Single-machine multi-GPUs and multi-machine multi-GPUs

Checkpointing and fault tolerance

Model evaluation and hyperparameter tuning

Model serving in data parallelism

Summary

Chapter 4: Bottlenecks and Solutions

Communication bottlenecks in data parallel training

Leveraging idle links and host resources

On-device memory bottlenecks

Recomputation and quantization

Summary

Section 2 – Model Parallelism

Chapter 5: Splitting the Model

Technical requirements

Single-node training error – out of memory

ELMo, BERT, and GPT

Pre-training and fine-tuning

State-of-the-art hardware

Summary

Chapter 6: Pipeline Input and Layer Split

Vanilla model parallelism is inefficient

Pipeline input

Pros and cons of pipeline parallelism

Layer split

Notes on intra-layer model parallelism

Summary

Chapter 7: Implementing Model Parallel Training and Serving Workflows

Technical requirements

Wrapping up the whole model parallelism pipeline

Fine-tuning transformers

Hyperparameter tuning in model parallelism

NLP model serving

Summary

Chapter 8: Achieving Higher Throughput and Lower Latency

Technical requirements

Freezing layers

Exploring memory and storage resources

Understanding model decomposition and distillation

Reducing bits in hardware

Summary

Section 3 – Advanced Parallelism Paradigms

Chapter 9: A Hybrid of Data and Model Parallelism

Technical requirements

Case study of Megatron-LM

Implementation of Megatron-LM

Case study of Mesh-TensorFlow

Implementation of Mesh-TensorFlow

Pros and cons of Megatron-LM and Mesh-TensorFlow

Summary

Chapter 10: Federated Learning and Edge Devices

Technical requirements

Sharing knowledge without sharing data

Case study: TensorFlow Federated

Running edge devices with TinyML

Case study: TensorFlow Lite

Summary

Chapter 11: Elastic Model Training and Serving

Technical requirements

Introducing adaptive model training

Implementing adaptive model training in the cloud

Elasticity in model inference

Summary

Chapter 12: Advanced Techniques for Further Speed-Ups

Technical requirements

Debugging and performance analytics

Job migration and multiplexing

Model training in a heterogeneous environment

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data parallelism – the high-level bits

So far, we have discussed the benefits of using data parallelism in machine learning model training, which can tremendously reduce the overall model training time. Now, we need to dive into some fundamental theories about how data parallel training works, such as stochastic gradient descent (SGD) and model synchronization. But before that, let's take a look at the system architecture for data parallel training, and how it is different from single-node training.

The simplified workflow for data parallel training is depicted in the following diagram. We have omitted some technical details during the training phase as we are mainly concerned with the two bandwidths (that is, the data loading bandwidth and the model training bandwidth):

Figure 1.5 – Simplified workflow of data parallel training

As we can see, the main difference between single-node training and data parallel training is that we split the data loading bandwidth between multiple workers/GPUs (shown as blue arrows in the preceding diagram). Therefore, for each GPU involved in the data parallel training job, the difference between its local data loading bandwidth and model training bandwidth is much smaller compared to the single-node case.

At a high level, even though we cannot increase the model training bandwidth on each accelerator due to hardware limitations, we can split and balance the whole data loading bandwidth across multiple accelerators. And this data loading bandwidth split is not only applicable to data parallel training. It can be directly adopted in the data parallel model serving stage.

Note

By decreasing the per-GPU data loading bandwidth, data parallel training mitigates the gap between data loading bandwidth and model training bandwidth on each GPU.

At this point, we should understand how data parallel training increases end-to-end throughput by splitting the data loading bandwidth across multiple accelerators. After each GPU receives its local batch of augmented input data, it will conduct local model training and validation. Here, model validation in data parallel training is the same as in the single-node case (there are some small variations, which we will discuss later) and we mainly focus on the difference at the training stage (excluding validation).

As shown in the following diagram, in the case of a single node, we divide the model training stage into three steps: data loading, training, and model updating. As we mentioned in the Single-node training is too slow section, data loading is for loading new mini-batches of training data. Training is done to conduct forward and backward propagations through the model. Once we've generated gradients during backward propagation, we perform the third step; that is, updating the model parameters:

Figure 1.6 – The three steps in the model training stage

Compared to the data parallel training stage, as shown in the following diagram, there are several major differences:

First, in data parallel training, different accelerators are trained on different batches of input data (for example, Partition 1 and Partition 2 in the following diagram). Consequently, none of the GPUs can see the full training data. Thus, traditional gradient descent optimization cannot be applied here. We also need to do a stochastic approximation of gradient descent, which can be used in the single-node case. One popular stochastic approximation method is SGD. We will look at this in more detail in the next section.
Second, in data parallel training, besides the three steps included in single-node training, as shown in the preceding diagram, we have an additional step here called model synchronization, which is shown in the following diagram. Model synchronization is about collecting and aggregating local gradients that have been generated by different nodes. We will learn more about model synchronization later in this book:

Figure 1.7 – Data parallelism procedures within the model training stage

In the next two sections, we will discuss the theoretical details about SGD and model synchronization.

Stochastic gradient descent

In this section, we will discuss why SGD is a must-have for data parallel training and how it works.

In theory, we can use traditional gradient descent (GD) for single-node training. It works as follows:

for i in dataset:
  g_all += g_i
w = w - a*g_all

First, we need to calculate the gradients from each data point of our training dataset, where g_i is the gradients. Here, we calculate this on the i-th training data point. The formal definition of g_i is as follows:

Then, we sum up all the gradients that have been calculated by all the training data points (g_all += g_i) and then do a single step model update with w = w - a*g_all.

However, in data parallel training, each GPU can only see part of (not the full) training dataset, which makes it impossible to use traditional GD optimization since we cannot calculate g_all in this case. Thus, SGD is a must-have. In addition, SGD is also applicable to single-node training. SGD works as follows:

for i in dataset:
  w = w - a*g_i

Basically, instead of updating the model weights (w) after generating the gradients from all the training data, SGD allows for model weights updates using a single or a few training samples (for example, a mini-batch). With this relaxation of model updating restrictions, the workers in data parallel training can update their model weights using their local (not global) training samples.

GD versus SGD

In GD, we need to compute the gradients over all the training data and update the model weights.

In SGD, we compute the gradients over a subset of all the training data and update the model weights.

However, since each worker updates their model weights based on their local training data, the model parameters of different workers can be different after each of the training iterations. Therefore, we need to conduct model synchronization periodically to guarantee that all the workers are on the same page, meaning that they maintain the model parameters after each training iteration.

Model synchronization

As we saw previously, in data parallel training, different workers train their local models using disjointed subsets of the total training data, so the trained model weights may be different. To force all the workers to have the same view of the model parameters, we need to conduct model synchronization.

Let's study this in a simple four-GPU setting, as shown in the following diagram:

Figure 1.8 – Model synchronization in a four-GPU setting

As we can see, we have four GPUs in a data parallel training job. Here, each GPU maintains a copy of the full ML model locally inside its on-device memory.

Let's assume that all the GPUs are initialized with the same model parameters, which is a standard practice, by setting the randomize function with a fixed seed.

After the first training iteration, each GPU will generate its local gradients as , where i refers to the i-th GPU. Given that they are training on different local training inputs, all the gradients from different GPUs may be different. To guarantee that all four GPUs have the same model updates, we need to conduct model synchronization before the model parameter updates:

Model synchronization does two things:

Collects and sums up all the gradients from all the GPUs in use, as shown here:

Broadcasts the aggregated gradients to all the GPUs.

Once the model synchronization steps have been completed, we can get the aggregated gradients, , locally on each GPU. Then, we can use these aggregated gradients, , for the model updates, which guarantees that the updated model parameters remain the same after this first data parallel training iteration.

Similarly, in the following training iterations, we conduct model synchronization after each GPU generates its local gradients. So, model synchronization guarantees that the model parameters remain the same after every training iteration in a particular data parallel training job.

For the real system implementations, this model synchronization mainly has two different variations: the parameter server architecture and the All-Reduce architecture, which we will discuss in detail in the next chapter.

So far, we have come across some of the key concepts in data parallel training jobs, such as SGD and model synchronization. Next, we will discuss some important hyperparameters related to data parallel training.

Distributed Machine Learning with Python

By : Guanhua Wang

Distributed Machine Learning with Python

By: Guanhua Wang

Overview of this book

Related Content you might be interested in

Current Title:

Distributed Machine Learning with Python

Accelerate Deep Learning Workloads with Amazon SageMaker

Pretrain Vision and Large Language Models in Python

The Machine Learning Solutions Architect Handbook

Data parallelism – the high-level bits

Stochastic gradient descent

Model synchronization