Operationalizing Deep Learning Training
In Chapter 1, Introducing Deep Learning with Amazon SageMaker, we discussed how SageMaker integrates with CloudWatch Logs and Metrics to provide visibility into your training process by collecting training logs and metrics. However, deep learning (DL) training jobs are prone to multiple types of specific issues related to model architecture and training configuration. Specialized tools are required to monitor, detect, and react to these issues. Since many training jobs run for hours and days on large amounts of compute instances, the cost of errors is high.
When running DL training jobs, you need to be aware of two types of issues:
- Issues with model and training configuration, which prevent the model from efficient learning during training. Examples of such issues include vanishing and exploding gradients, overfitting and underfitting, not decreasing loss, and others. The process of finding such errors is known as debugging.
- Suboptimal...