-
Book Overview & Buying
-
Table Of Contents
Accelerate Deep Learning Workloads with Amazon SageMaker
By :
In this chapter, we focused on the hardware aspects of engineering DL distributed training. We reviewed the available SageMaker compute instances and focused on instance families with GPU devices. After that, we discussed different DL use cases and how to select optimal compute instances for them. Then, we reviewed the network requirements for distributed training and learned how Amazon EFA can help you avoid network bottlenecks when running large-scale training jobs. We also reviewed how models can be optimized to run on GPU devices using SageMaker Training Compiler and gained practical experience in using this feature.
In the next chapter, Chapter 6, Engineering Distributed Training, we will continue this discussion of distributed training. We will focus on how to select the most appropriate type of distributed training for your use case, DL framework, and model architecture and then develop practical experience in these areas.