What this book covers
Chapter 1, Introducing Deep Learning with Amazon SageMaker, will introduce Amazon SageMaker: how it simplifies infrastructure and workload management, and what the key principles of this AWS service and its main capabilities are. We will then focus on the managed training, hosting infrastructure, and integration with the rest of the AWS services it provides.
Chapter 2, Deep Learning Frameworks and Containers on SageMaker, will review in detail how SageMaker extensively utilizes Docker containers. We will start by diving into pre-built containers for popular DL frameworks (Tensorflow, PyTorch, and MXNet). Then, we will consider how to extend pre-build SageMaker containers and BYO containers. For the latter case, we will review the technical requirements for training and serving containers in SageMaker.
Chapter 3, Managing SageMaker Development Environment, will discuss how to manage SageMaker resources programmatically using a CLI, SDKs, and CloudFormation. We will discuss how to organize an efficient development process using SageMaker Studio and Notebooks as well as how to integrate with your favorite IDE. We will also review troubleshooting your DL code using SageMaker Local Mode. We will review various SageMaker capabilities that allow us to organize and manage datasets and discuss various storage options on AWS and their application use cases.
Chapter 4, Managing Deep Learning Datasets, will provide practical guidance on setting up the first DL project on SageMaker and then building, training, and using a simple DL model. We will provide a follow-along implementation of this project so that readers can learn and experiment with the core SageMaker capabilities themselves.
Chapter 5, Considering Hardware for Deep Learning Training, will consider the price performance characteristics of the most suitable instances for DL models and cover in which scenarios to use one instance type or another for optimal performance.
Chapter 6, Engineering Distributed Training, will focus on understanding the common approaches to distributing your training processes and why you may need to do so for DL models. Then, we will provide an overview of both open source training distribution frameworks as well as innate SageMaker libraries for distributed training.
Chapter 7, Operationalizing Deep Learning Training, will discuss how to monitor and debug your DL training job using SageMaker Debugger and its Profiler as well as how to optimize for cost using Managed Spot Training, early stopping, and other strategies.
Chapter 8, Considering Hardware for Inference, will provide practical guidance on building NLP state-of-the-art models using the PyTorch and Hugging Face frameworks. Readers will follow along with the code to learn how to prepare a training script for distributed training on Amazon SageMaker and then monitor and further optimize the training job. We will use the SageMaker Data Parallel library for distributing training computations.
Chapter 9, Implementing Model Servers, will start by reviewing key components of SageMaker Managed Hosting, such as real-time endpoints and batch inference jobs, model registry, and serving containers. Readers will learn how to configure their endpoint deployment and batch inference jobs using a Python SDK.
Chapter 10, Operationalizing Inference Workloads, will focus on the software stack of DL servers, specifically, on model servers. We will review the model servers provided by the popular TensorFlow and PyTorch solutions as well as framework-agnostic model servers such as SageMaker Multi Model Server. We will discuss when to choose one option over another.