-
Book Overview & Buying
-
Table Of Contents
Accelerate Deep Learning Workloads with Amazon SageMaker
By :
In this chapter, we focused on how to engineer large-scale data parallel, model parallel, and hybrid distributed training jobs. We discussed which type of parallelism to choose based on your specific use case and model architecture. Then, we reviewed several popular approaches of how to organize distributed training – such as the Parameter Server and Allreduce algorithms – along with various performance considerations to tune distributed training jobs. You will now be able to select the correct type of distributed training, technical stack, and approach to debug and tune training job performance. Then, we reviewed several examples of distributed training jobs in Amazon SageMaker using the popular open source and proprietary libraries SDDP and SMDP.
Running large-scale training jobs requires not only initial engineering efforts but also the well-established operational management of your training jobs. In many cases, the training job can run for days and weeks...