AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

By : Somanath Nanda, Weslley Moura

AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

By: Somanath Nanda, Weslley Moura

Overview of this book

The AWS Certified Machine Learning Specialty exam tests your competency to perform machine learning (ML) on AWS infrastructure. This book covers the entire exam syllabus using practical examples to help you with your real-world machine learning projects on AWS. Starting with an introduction to machine learning on AWS, you'll learn the fundamentals of machine learning and explore important AWS services for artificial intelligence (AI). You'll then see how to prepare data for machine learning and discover a wide variety of techniques for data manipulation and transformation for different types of variables. The book also shows you how to handle missing data and outliers and takes you through various machine learning tasks such as classification, regression, clustering, forecasting, anomaly detection, text mining, and image processing, along with the specific ML algorithms you need to know to pass the exam. Finally, you'll explore model evaluation, optimization, and deployment and get to grips with deploying models in a production environment and monitoring them. By the end of this book, you'll have gained knowledge of the key challenges in machine learning and the solutions that AWS has released for each of them, along with the tools, methods, and techniques commonly used in each domain of AWS ML.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Introduction to Machine Learning

Free Chapter

Chapter 1: Machine Learning Fundamentals

Comparing AI, ML, and DL

Classifying supervised, unsupervised, and reinforcement learning

The CRISP-DM modeling life cycle

Data splitting

Modeling expectations

Introducing ML frameworks

ML in the cloud

Summary

Questions

Chapter 2: AWS Application Services for AI/ML

Technical requirements

Analyzing images and videos with Amazon Rekognition

Text to speech with Amazon Polly

Speech to text with Amazon Transcribe

Implementing natural language processing with Amazon Comprehend

Translating documents with Amazon Translate

Extracting text from documents with Amazon Textract

Creating chatbots on Amazon Lex

Summary

Section 2: Data Engineering and Exploratory Data Analysis

Chapter 3: Data Preparation and Transformation

Identifying types of features

Dealing with categorical features

Dealing with numerical features

Understanding data distributions

Handling missing values

Dealing with outliers

Dealing with unbalanced datasets

Dealing with text data

Summary

Questions

Chapter 4: Understanding and Visualizing Data

Visualizing relationships in your data

Visualizing comparisons in your data

Visualizing distributions in your data

Visualizing compositions in your data

Building key performance indicators

Introducing Quick Sight

Summary

Questions

Chapter 5: AWS Services for Data Storing

Technical requirements

Storing data on Amazon S3

Controlling access to buckets and objects on Amazon S3

Protecting data on Amazon S3

Securing S3 objects at rest and in transit

Using other types of data stores

Relational Database Services (RDSes)

Managing failover in Amazon RDS

Taking automatic backup, RDS snapshots, and restore and read replicas

Writing to Amazon Aurora with multi-master capabilities

Storing columnar data on Amazon Redshift

Amazon DynamoDB for NoSQL database as a service

Summary

Chapter 6: AWS Services for Data Processing

Technical requirements

Creating ETL jobs on AWS Glue

Querying S3 data using Athena

Processing real-time data using Kinesis data streams

Storing and transforming real-time data using Kinesis Data Firehose

Different ways of ingesting data from on-premises into AWS

Processing stored data on AWS

Summary

Section 3: Data Modeling

Chapter 7: Applying Machine Learning Algorithms

Introducing this chapter

Storing the training data

A word about ensemble models

Supervised learning

Unsupervised learning

Textual analysis

Image processing

Summary

Questions

Chapter 8: Evaluating and Optimizing Models

Introducing model evaluation

Evaluating classification models

Evaluating regression models

Model optimization

Summary

Questions

Chapter 9: Amazon SageMaker Modeling

Technical requirements

Creating notebooks in Amazon SageMaker

Model tuning

Choosing instance types in Amazon SageMaker

Securing SageMaker notebooks

Creating alternative pipelines with Lambda Functions

Working with Step Functions

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The CRISP-DM modeling life cycle

Modeling is a very common term used in ML when we want to specify the steps taken to solve a particular problem. For example, we could create a binary classification model to predict whether those transactions from Figure 1.2 are fraudulent or not.

A model, in this context, represents all the steps to create a solution as a whole, which includes (but is not limited to) the algorithm. The Cross-Industry Standard Process for Data Mining, more commonly referred to as CRISP-DM, is one of the methodologies that provides guidance on the common steps we should follow to create models. This methodology is widely used by the market and is covered in the AWS Machine Learning Specialty exam:

Figure 1.4 – CRISP-DM methodology

Everything starts with business understanding, which will produce the business objectives (including success criteria), situation assessment, data mining goals, and project plan (with an initial assessment of tools and techniques). During the situation assessment, we should also look into an inventory of resources, requirements, assumptions and constraints, risks, terminology, costs, and benefits. Every single assumption and success criterion matters when we are modeling.

Then we pass on to data understanding, where we will collect raw data, describe it, explore it, and check its quality. This is an initial assessment of the data that will be used to create the model. Again, data scientists must be skeptical. You must be sure you understand all the nuances of the data and its source.

The data preparation phase is actually the one that usually consumes most of the time during modeling. In this phase, we need to select and filter the data, clean it according to the task that needs to be performed, come up with new attributes, integrate the data with other data sources, and format it as expected by the algorithm that will be applied. These tasks are often called feature engineering.

Once the data is prepared, we can finally start the modeling phase. Here is where the algorithms come in. We should start by ensuring the selection of the right technique. Remember: according to the presence or absence of a target variable (and its data type), we will have different algorithms to choose from. Each modeling technique might carry some implicit assumptions of which we have to be aware. For example, if you choose a multiple linear regression algorithm to predict house prices, you should be aware that this type of model expects a linear relationship between the variables of your data.

There are hundreds of algorithms out there and each of them might have its own assumptions. After choosing the ones that you want to test in your project, you should spend some time checking their specifics. In later chapters of this book, we will cover some of them.

Important note

Some algorithms incorporate in their logic what we call feature selection. This is a step where the most important features will be selected to build your best model. Decision trees are examples of algorithms that perform feature selection automatically. We will cover feature selection in more detail later on, since there are different ways to select the best variables for your model.

During the modeling phase, you should also design a testing approach for the model, defining which evaluation metrics will be used and how the data will be split. With that in place, you can finally build the model by setting the hyperparameters of the algorithm and feeding the model with data. This process of feeding the algorithm with data to find a good estimator is known as the training process. The data used to feed the model is known as training data. There are different ways to organize the training and testing data, which we will cover in this chapter.

Important note

ML algorithms are built by parameters and hyperparameters. These are learned from the data. For example, a decision-tree-based algorithm might learn from the training data that a particular feature should compose its root level based on information gain assessments. Hyperparameters, on the other hand, are used to control the learning process. Taking the same example about decision trees, we could specify the maximum allowed depth of the tree by specifying a pre-defined hyperparameter of any decision tree algorithm (regardless of the underlining training data). Hyperparameter tuning is a very important topic in the exam and will be covered in fine-grained detail later on.

Once the model is trained, we can evaluate and review results in order to propose the next steps. If results are not acceptable (based on our business success criteria), we should come back to earlier steps to check what else can be done to improve the model results. It can either be a small tuning in the hyperparameters of the algorithm, a new data preparation step, or even a redefinition of business drivers. On the other hand, if the model quality is acceptable, we can move to the deployment phase.

In this last phase of the CRISP-DM methodology, we have to think about the deployment plan, monitoring, and maintenance of the model. We usually look at this step from two perspectives: training and inference. The training pipeline consists of those steps needed to train the model, which includes data preparation, hyperparameter definition, data splitting, and model training itself. Somehow, we must store all the model artifacts somewhere, since they will be used by the next pipeline that needs to be developed: the inference pipeline.

The inference pipeline just uses model artifacts to execute the model against brand-new observations (data that has never been seen by the model during the training phase). For example, if the model was trained to identify fraudulent transactions, this is the time where new transactions will pass through the model to be classified.

In general, models are trained once (through the training pipeline) and executed many times (through the inference pipeline). However, after some time, it is expected that there will be some model degradation, also known as model drift. This phenomenon happens because the model is usually trained in a static training set that aims to represent the business scenario at a given point in time; however, businesses evolve, and it might be necessary to retrain the model on more recent data to capture new business aspects. That's why it is important to keep tracking model performance even after model deployment.

The CRISP-DM methodology is so important to the context of the AWS Machine Learning Specialty exam that, if you look at the four domains covered by AWS, you will realize that they were generalized from the CRISP-DM stages: data engineering, exploratory data analysis, modeling, and ML implementation and operations.

We now understand all the key stages of a modeling pipeline and we know that the algorithm itself is just part of a broad process! Next, let's see how we can split our data to create and validate ML models.

AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

By : Somanath Nanda, Weslley Moura

AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

By: Somanath Nanda, Weslley Moura

Overview of this book

Related Content you might be interested in

Current Title:

AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

AWS Certified Cloud Practitioner Exam Guide

Hands-On Artificial Intelligence on Amazon Web Services

Data Wrangling on AWS

The CRISP-DM modeling life cycle