-
Book Overview & Buying
-
Table Of Contents
Accelerate Deep Learning Workloads with Amazon SageMaker
By :
In the previous chapter, we discussed how to select optimal hardware for the Deep Learning (DL) training job and optimize your model for the target hardware platform. In this chapter, we will consider, in depth, how to design efficient distributed training on Amazon SageMaker given your particular use case and model architecture.
There are two specific problems that distributed training aims to address. The first problem is how to reduce the training time of large models by distributing training tasks across multiple compute devices. Another problem arises when we need to train large models that cannot fit into the memory of a single GPU device. This problem is especially relevant for NLP tasks where it’s shown that very large models have more expressive power and, hence, better performance on a wide range of NLP tasks. For instance, the latest open source SOTA language model, called BLOOM, was trained for ~3.5 months on a compute cluster...