Debugging training jobs
To effectively monitor and debug DL training jobs, we need to have access to the following information:
- Scalar values such as accuracy and loss, which we use to measure the quality of the training process
- Tensor values such as weights, biases, and gradients, which represent the internal state of the model and its optimizers
Both TensorBoard and SageMaker Debugger allow you to collect tensors and scalars, so both can be used to debug the model and training processes. However, unlike TensorBoard, which is primarily used for training visualizations, SageMaker Debugger provides functionality to react to changes in model states in near-real time. For example, it allows us to stop training jobs earlier if training loss hasn’t decreased for a certain period.
In this section, we will dive deep into how to use TensorBoard and SageMaker Debugger. We will review the features of both solutions in detail and then develop practical experiences...