Book Image

Learn Amazon SageMaker - Second Edition

By : Julien Simon
Book Image

Learn Amazon SageMaker - Second Edition

By: Julien Simon

Overview of this book

Amazon SageMaker enables you to quickly build, train, and deploy machine learning models at scale without managing any infrastructure. It helps you focus on the machine learning problem at hand and deploy high-quality models by eliminating the heavy lifting typically involved in each step of the ML process. This second edition will help data scientists and ML developers to explore new features such as SageMaker Data Wrangler, Pipelines, Clarify, Feature Store, and much more. You'll start by learning how to use various capabilities of SageMaker as a single toolset to solve ML challenges and progress to cover features such as AutoML, built-in algorithms and frameworks, and writing your own code and algorithms to build ML models. The book will then show you how to integrate Amazon SageMaker with popular deep learning libraries, such as TensorFlow and PyTorch, to extend the capabilities of existing models. You'll also see how automating your workflows can help you get to production faster with minimum effort and at a lower cost. Finally, you'll explore SageMaker Debugger and SageMaker Model Monitor to detect quality issues in training and production. By the end of this Amazon book, you'll be able to use Amazon SageMaker on the full spectrum of ML workflows, from experimentation, training, and monitoring to scaling, deployment, and automation.
Table of Contents (19 chapters)
1
Section 1: Introduction to Amazon SageMaker
4
Section 2: Building and Training Models
11
Section 3: Diving Deeper into Training
14
Section 4: Managing Models in Production

Exploring the capabilities of Amazon SageMaker

Amazon SageMaker was launched at AWS re:Invent 2017. Since then, a lot of new features have been added: you can see the full (and ever-growing) list at https://aws.amazon.com/about-aws/whats-new/machine-learning.

In this section, you'll learn about the main capabilities of Amazon SageMaker and its purpose. Don't worry, we'll dive deep into each of them in later chapters. We will also talk about the SageMaker Application Programming Interfaces (APIs), and the Software Development Kits (SDKs) that implement them.

The main capabilities of Amazon SageMaker

At the core of Amazon SageMaker is the ability to prepare, build, train, optimize, and deploy models on fully managed infrastructure at any scale. This lets you focus on studying and solving the machine learning problem at hand, instead of spending time and resources on building and managing infrastructure. Simply put, you can go from building to training to deploying more quickly. Let's zoom in on each step and highlight relevant SageMaker capabilities.

Preparing

Amazon SageMaker includes powerful tools to label and prepare datasets:

  • Amazon SageMaker Ground Truth: Annotate datasets at any scale. Workflows for popular use cases are built in (image detection, entity extraction, and more), and you can implement your own. Annotation jobs can be distributed to workers that belong to private, third-party, or public workforces.
  • Amazon SageMaker Processing: Run batch jobs for data processing (and other tasks such as model evaluation) using your own code written with scikit-learn or Spark.
  • Amazon SageMaker Data Wrangler: Using a graphical interface, apply hundreds of built-in transforms (or your own) to tabular datasets, and export them in one click to a Jupyter notebook.
  • Amazon SageMaker Feature Store: Store your engineered features offline in Amazon S3 to build datasets, or online to use them at prediction time.
  • Amazon SageMaker Clarify: Using a variety of statistical metrics, analyze potential bias present in your datasets and models, and explain how your models predict.

Building

Amazon SageMaker provides you with two development environments:

  • Notebook instances: Fully managed Amazon EC2 instances that come preinstalled with the most popular tools and libraries: Jupyter, Anaconda, and so on.
  • Amazon SageMaker Studio: An end-to-end integrated development environment for machine learning projects, providing an intuitive graphical interface for many SageMaker capabilities. Studio is now the preferred way to run notebooks, and we recommend that you use it instead of notebook instances.

When it comes to experimenting with algorithms, you can choose from the following:

  • A collection of 17 built-in algorithms for machine learning and deep learning, already implemented and optimized to run efficiently on AWS. No Machine learning code to write!
  • A collection of built-in, open source frameworks (TensorFlow, PyTorch, Apache MXNet, scikit-learn, and more), where you simply bring your own code.
  • Your own code running in your own container: custom Python, R, C++, Java, and so on.
  • Algorithms and pre-trained models from AWS Marketplace for machine learning (https://aws.amazon.com/marketplace/solutions/machine-learning).
  • Machine learning solutions and state-of-the-art models available in one click in Amazon SageMaker JumpStart.

In addition, Amazon SageMaker Autopilot uses AutoMachine learning to automatically build, train, and optimize models without the need to write a single line of Machine learning code.

Training

As mentioned earlier, Amazon SageMaker takes care of provisioning and managing your training infrastructure. You'll never spend any time managing servers, and you'll be able to focus on machine learning instead. On top of this, SageMaker brings advanced capabilities such as the following:

  • Managed storage using either Amazon S3, Amazon EFS, or Amazon FSx for Lustre depending on your performance requirements.
  • Managed spot training, using Amazon EC2 Spot instances for training in order to reduce costs by up to 80%.
  • Distributed training automatically distributes large-scale training jobs on a cluster of managed instances, using advanced techniques such as data parallelism and model parallelism.
  • Pipe mode streams infinitely large datasets from Amazon S3 to the training instances, saving the need to copy data around.
  • Automatic model tuning runs hyperparameter optimization to deliver high-accuracy models more quickly.
  • Amazon SageMaker Experiments easily tracks, organizes, and compares all your SageMaker jobs.
  • Amazon SageMaker Debugger captures the internal model state during training, inspects it to observe how the model learns, detects unwanted conditions that hurt accuracy, and profiles the performance of your training job.

Deploying

Just as with training, Amazon SageMaker takes care of all your deployment infrastructure, and brings a slew of additional features:

  • Real-time endpoints create an HTTPS API that serves predictions from your model. As you would expect, autoscaling is available.
  • Batch transform uses a model to predict data in batch mode.
  • Amazon Elastic Inference adds fractional GPU acceleration to CPU-based endpoints to find the best cost/performance ratio for your prediction infrastructure.
  • Amazon SageMaker Model Monitor captures data sent to an endpoint and compares it with a baseline to identify and alert on data quality issues (missing features, data drift, and more).
  • Amazon SageMaker Neo compiles models for a specific hardware architecture, including embedded platforms, and deploys an optimized version using a lightweight runtime.
  • Amazon SageMaker Edge Manager helps you deploy and manage your models on edge devices.
  • Last but not least, Amazon SageMaker Pipelines lets you build end-to-end automated pipelines to run and manage your data preparation, training, and deployment workloads.

The Amazon SageMaker API

Just like all other AWS services, Amazon SageMaker is driven by APIs that are implemented in the language SDKs supported by AWS (https://aws.amazon.com/tools/). In addition, a dedicated Python SDK, aka the SageMaker SDK is also available. Let's look at both, and discuss their respective benefits.

The AWS language SDKs

Language SDKs implement service-specific APIs for all AWS services: S3, EC2, and so on. Of course, they also include SageMaker APIs, which are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/api-and-sdk-reference.htmachine learning.

When it comes to data science and machine learning, Python is the most popular language, so let's take a look at the SageMaker APIs available in boto3, the AWS SDK for the Python language (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.htmachine learning). These APIs are quite low-level and verbose: for example, create_training_job() has a lot of JSON parameters that don't look very obvious. You can see some of them in the next screenshot. You may think that this doesn't look very appealing for everyday Machine learning experimentation… and I would totally agree!

Figure 1.1 – A (partial) view of the create_training_job() API in boto3

Figure 1.1 – A (partial) view of the create_training_job() API in boto3

Indeed, these service-level APIs are not meant to be used for experimentation in notebooks. Their purpose is automation, through either bespoke scripts or Infrastructure as Code tools such as AWS CloudFormation (https://aws.amazon.com/cloudformation) and Terraform (https://terraform.io). Your DevOps team will use them to manage production, where they do need full control over each possible parameter.

So, what should you use for experimentation? You should use the Amazon SageMaker SDK.

The Amazon SageMaker SDK

The Amazon SageMaker SDK (https://github.com/aws/sagemaker-python-sdk) is a Python SDK specific to Amazon SageMaker. You can find its documentation at https://sagemaker.readthedocs.io/en/stable/.

Note

Every effort has been made to check the code examples in this book with the latest SageMaker SDK (v2.58.0 at the time of writing).

Here, the abstraction level is much higher: the SDK contains objects for models, estimators, models, predictors, and so on. We're definitely back in Machine learning territory.

For instance, this SDK makes it extremely easy and comfortable to fire up a training job (one line of code) and to deploy a model (one line of code). Infrastructure concerns are abstracted away, and we can focus on Machine learning instead. Here's an example. Don't worry about the details for now:

# Configure the training job
my_estimator = TensorFlow(
    entry_point='my_script.py',
    role=my_sagemaker_role,
    train_instance_type='machine learning.p3.2xlarge',
    instance_count=1,
    framework_version='2.1.0')
# Train the model
my_estimator.fit('s3://my_bucket/my_training_data/')
# Deploy the model to an HTTPS endpoint
my_predictor = my_estimator.deploy(
    initial_instance_count=1, 
    instance_type='machine learning.c5.2xlarge')

Now that we know a little more about Amazon SageMaker, let's see how we can set it up.