Summary
This chapter concludes Part 2 of this book. In this and the two previous chapters, we discussed how to build and optimize large-scale training jobs. First, we reviewed the available specialized hardware for DL training and how to choose optimal instance types. Then, we discussed how to engineer distributed training using open source and Amazon proprietary solutions. In this chapter, we discussed how to efficiently operationalize your model training. We reviewed different issues that may occur during training and how to detect and mitigate them. We also discussed how to manage and optimize hyperparameter tuning.
In Part 3, Serving Deep Learning Models, we will dive deep into DL inference on Amazon SageMaker. We will discuss what hardware is available for inference and how to engineer your inference server. Then, we will review the operational aspects of model serving. In the next chapter, Chapter 8, Considering Hardware for Inference, we will review the available hardware...