Book Image

Machine Learning Engineering on AWS

By : Joshua Arvin Lat
Book Image

Machine Learning Engineering on AWS

By: Joshua Arvin Lat

Overview of this book

There is a growing need for professionals with experience in working on machine learning (ML) engineering requirements as well as those with knowledge of automating complex MLOps pipelines in the cloud. This book explores a variety of AWS services, such as Amazon Elastic Kubernetes Service, AWS Glue, AWS Lambda, Amazon Redshift, and AWS Lake Formation, which ML practitioners can leverage to meet various data engineering and ML engineering requirements in production. This machine learning book covers the essential concepts as well as step-by-step instructions that are designed to help you get a solid understanding of how to manage and secure ML workloads in the cloud. As you progress through the chapters, you’ll discover how to use several container and serverless solutions when training and deploying TensorFlow and PyTorch deep learning models on AWS. You’ll also delve into proven cost optimization techniques as well as data privacy and model privacy preservation strategies in detail as you explore best practices when using each AWS. By the end of this AWS book, you'll be able to build, scale, and secure your own ML systems and pipelines, which will give you the experience and confidence needed to architect custom solutions using a variety of AWS services for ML engineering requirements.
Table of Contents (19 chapters)
1
Part 1: Getting Started with Machine Learning Engineering on AWS
5
Part 2:Solving Data Engineering and Analysis Requirements
8
Part 3: Diving Deeper with Relevant Model Training and Deployment Solutions
11
Part 4:Securing, Monitoring, and Managing Machine Learning Systems and Environments
14
Part 5:Designing and Building End-to-end MLOps Pipelines

Preparing the dataset

In this chapter, we will build multiple ML models that will predict whether a hotel booking will be cancelled or not based on the information available. Hotel cancellations cause a lot of issues for hotel owners and managers, so trying to predict which reservations will be cancelled is a good use of our ML skills.

Before we start with our ML experiments, we will need a dataset that can be used when training our ML models. We will generate a realistic synthetic dataset similar to the Hotel booking demands dataset from Nuno Antonio, Ana de Almeida, and Luis Nunes.

The synthetic dataset will have a total of 21 columns. Here are some of the columns:

  • is_cancelled: Indicates whether the hotel booking was cancelled or not
  • lead_time: [arrival date] – [booking date]
  • adr: Average daily rate
  • adults: Number of adults
  • days_in_waiting_list: Number of days a booking stayed on the waiting list before getting confirmed
  • assigned_room_type: The type of room that was assigned
  • total_of_special_requests: The total number of special requests made by the customer

We will not discuss each of the fields in detail, but this should help us understand what data is available for us to use. For more information, you can find the original version of this dataset at https://www.kaggle.com/jessemostipak/hotel-booking-demand and https://www.sciencedirect.com/science/article/pii/S2352340918315191.

Generating a synthetic dataset using a deep learning model

One of the cool applications of ML would be having a deep learning model “absorb” the properties of an existing dataset and generate a new dataset with a similar set of fields and properties. We will use a pre-trained Generative Adversarial Network (GAN) model to generate the synthetic dataset.

Important note

Generative modeling involves learning patterns from the values of an input dataset, which are then used to generate a new dataset with a similar set of values. GANs are popular when it comes to generative modeling. For example, research papers have focused on how GANs can be used to generate “deepfakes,” where realistic images of humans are generated from a source dataset.

Generating and using a synthetic dataset has a lot of benefits, including the following:

  • The ability to generate a much larger dataset than the original dataset that was used to train the model
  • The ability to anonymize any sensitive information in the original dataset
  • Being able to have a cleaner version of the dataset after data generation

That said, let’s start generating the synthetic dataset by running the following commands in the terminal of our Cloud9 environment (right after the $ sign at the bottom of the screen):

  1. Continuing from where we left off in the Installing the Python prerequisites section, run the following command to create an empty directory named tmp in the current working directory:
    mkdir -p tmp

Note that this is different from the /tmp directory.

  1. Next, let’s download the utils.py file using the wget command:
    wget -O utils.py https://bit.ly/3CN4owx

The utils.py file contains the block() function, which will help us read and troubleshoot the logs generated by our scripts.

  1. Run the following command to download the pre-built GAN model into the Cloud9 environment:
    wget -O hotel_bookings.gan.pkl https://bit.ly/3CHNQFT

Here, we have a serialized pickle file that contains the properties of the deep learning model.

Important note

There are a variety of ways to save and load ML models. One of the options would be to use the Pickle module to serialize a Python object and store it in a file. This file can later be loaded and deserialized back to a Python object with a similar set of properties.

  1. Create an empty data_generator.py script file using the touch command:
    touch data_generator.py

Important note

Before proceeding, make sure that the data_generator.py, hotel_bookings.gan.pkl, and utils.py files are in the same directory so that the synthetic data generator script works.

  1. Double-click the data_generator.py file in the file tree (located on the left-hand side of the Cloud9 environment) to open the empty Python script file in the editor pane.
  2. Add the following lines of code to import the prerequisites needed to run the script:
    from ctgan import CTGANSynthesizer
    from pandas_profiling import ProfileReport
    from utils import block, debug
  3. Next, let’s add the following lines of code to load the pre-trained GAN model:
    with block('LOAD CTGAN'):
        pkl = './hotel_bookings.gan.pkl'
        gan = CTGANSynthesizer.load(pkl)
        print(gan.__dict__)
  4. Run the following command in the terminal (right after the $ sign at the bottom of the screen) to test if our initial blocks of code in the script are working as intended:
    python3 data_generator.py

This should give us a set of logs similar to what is shown in the following screenshot:

Figure 1.7 – GAN model successfully loaded by the script

Figure 1.7 – GAN model successfully loaded by the script

Here, we can see that the pre-trained GAN model was loaded successfully using the CTGANSynthesizer.load() method. Here, we can also see what block (from the utils.py file we downloaded earlier) does to improve the readability of our logs. It simply helps mark the start and end of the execution of a block of code so that we can easily debug our scripts.

  1. Let’s go back to the editor pane (where we are editing data_generator.py) and add the following lines of code:
    with block('GENERATE SYNTHETIC DATA'):
        synthetic_data = gan.sample(10000)
        print(synthetic_data)

When we run the script later, these lines of code will generate 10000 records and store them inside the synthetic_data variable.

  1. Next, let’s add the following block of code, which will save the generated data to a CSV file inside the tmp directory:
    with block('SAVE TO CSV'):
        target_location = "tmp/bookings.all.csv"
        print(target_location)
        synthetic_data.to_csv(
            target_location, 
            index=False
        )
  2. Finally, let’s add the following lines of code to complete the script:
    with block('GENERATE PROFILE REPORT'):
        profile = ProfileReport(synthetic_data)
        target_location = "tmp/profile-report.html"
        profile.to_file(target_location)

This block of code will analyze the synthetic dataset and generate a profile report to help us analyze the properties of our dataset.

Important note

You can find a copy of the data_generator.py file here: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS/blob/main/chapter01/data_generator.py.

  1. With everything ready, let’s run the following command in the terminal (right after the $ sign at the bottom of the screen):
    python3 data_generator.py

It should take about a minute or so for the script to finish. Running the script should give us a set of logs similar to what is shown in the following screenshot:

Figure 1.8 – Logs generated by data_generator.py

Figure 1.8 – Logs generated by data_generator.py

As we can see, running the data_generator.py script generates multiple blocks of logs, which should make it easy for us to read and debug what’s happening while the script is running. In addition to loading the CTGAN model, the script will generate the synthetic dataset using the deep learning model (A), save the generated data in a CSV file inside the tmp directory (tmp/bookings.all.csv) (B), and generate a profile report using pandas_profiling (C).

Wasn’t that easy? Before proceeding to the next section, feel free to use the file tree (located on the left-hand side of the Cloud9 environment) to check the generated files stored in the tmp directory.

Exploratory data analysis

At this point, we should have a synthetic dataset with 10000 rows. You might be wondering what our data looks like. Does our dataset contain invalid values? Do we have to worry about missing records? We must have a good understanding of our dataset since we may need to clean and process the data first before we do any model training work. EDA is a key step when analyzing datasets before they can be used to train ML models. There are different ways to analyze datasets and generate reports — using pandas_profiling is one of the faster ways to perform EDA.

That said, let’s check the report that was generated by the pandas_profiling Python library. Right-click on tmp/profile-report.html in the file tree (located on the left-hand side of the Cloud9 environment) and then select Preview from the list of options. We should find a report similar to the following:

Figure 1.9 – Generated report

Figure 1.9 – Generated report

The report has multiple sections: Overview, Variables, Interactions, Correlations Missing Values, and Sample. In the Overview section, we can find a quick summary of the dataset statistics and the variable types. This includes the number of variables, number of records (observations), number of missing cells, number of duplicate rows, and other relevant statistics. In the Variables section, we can find the statistics and the distribution of values for each variable (column) in the dataset. In the Interactions and Correlations sections, we can see different patterns and observations regarding the potential relationship of the variables in the dataset. In the Missing values section, we can see if there are columns with missing values that we need to take care of. Finally, in the Sample section, we can see the first 10 and last 10 rows of the dataset.

Feel free to read through the report before proceeding to the next section.

Train-test split

Now that we have finished performing EDA, what do we do next? Assuming that our data is clean and ready for model training, do we just use all of the 10,000 records that were generated to train and build our ML model? Before we train our binary classifier model, we must split our dataset into training and test sets:

Figure 1.10 – Train-test split

Figure 1.10 – Train-test split

As we can see, the training set is used to build the model and update its parameters during the training phase. The test set is then used to evaluate the final version of the model on data it has not seen before. What’s not shown here is the validation set, which is used to evaluate a model to fine-tune the hyperparameters during the model training phase. In practice, the ratio when dividing the dataset into training, validation, and test sets is generally around 60:20:20, where the training set gets the majority of the records. In this chapter, we will no longer need to divide the training set further into smaller training and validation sets since the AutoML tools and services will automatically do this for us.

Important note

Before proceeding with the hands-on solutions in this section, we must have an idea of what hyperparameters and parameters are. Parameters are the numerical values that a model uses when performing predictions. We can think of model predictions as functions such as y = m * x, where m is a parameter, x is a single predictor variable, and y is the target variable. For example, if we are testing the relationship between cancellations (y) and income (x), then m is the parameter that defines this relationship. If m is positive, cancellations go up as income goes up. If it is negative, cancellations lessen as income increases. On the other hand, hyperparameters are configurable values that are tweaked before the model is trained. These variables affect how our chosen ML models “model” the relationship. Each ML model has its own set of hyperparameters, depending on the algorithm used. These concepts will make more sense once we have looked at a few more examples in Chapter 2, Deep Learning AMIs, and Chapter 3, Deep Learning Containers.

Now, let’s create a script that will help us perform the train-test split:

  1. In the terminal of our Cloud9 environment (right after the $ sign at the bottom of the screen), run the following command to create an empty file called train_test_split.py:
    touch train_test_split.py
  2. Using the file tree (located on the left-hand side of the Cloud9 environment), double-click the train_test_split.py file to open the file in the editor pane.
  3. In the editor pane, add the following lines of code to import the prerequisites to run the script:
    import pandas as pd
    from utils import block, debug
    from sklearn.model_selection import train_test_split
  4. Add the following block of code, which will read the contents of a CSV file and store it inside a pandas DataFrame:
    with block('LOAD CSV'):
        generated_df = pd.read_csv('tmp/bookings.all.csv')
  5. Next, let’s use the train_test_split() function from scikit-learn to divide the dataset we have generated into a training set and a test set:
    with block('TRAIN-TEST SPLIT'):
        train_df, test_df = train_test_split(
            generated_df, 
            test_size=0.3, 
            random_state=0
        )
        print(train_df)
        print(test_df)
  6. Lastly, add the following lines of code to save the training and test sets into their respective CSV files inside the tmp directory:
    with block('SAVE TO CSVs'):
        train_df.to_csv('tmp/bookings.train.csv', 
                        index=False)
        test_df.to_csv('tmp/bookings.test.csv', 
                       index=False)

Important note

You can find a copy of the train_test_split.py file here: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS/blob/main/chapter01/train_test_split.py.

  1. Now that we have completed our script file, let’s run the following command in the terminal (right after the $ sign at the bottom of the screen):
    python3 train_test_split.py

This should generate a set of logs similar to what is shown in the following screenshot:

Figure 1.11 – Train-test split logs

Figure 1.11 – Train-test split logs

Here, we can see that our training dataset contains 7,000 records, while the test set contains 3,000 records.

With this, we can upload our dataset to Amazon S3.

Uploading the dataset to Amazon S3

Amazon S3 is the object storage service for AWS and is where we can store different types of files, such as dataset CSV files and output artifacts. When using the different services of AWS, it is important to note that these services sometimes require the input data and files to be stored in an S3 bucket first or in a resource created using another service.

Uploading the dataset to S3 should be easy. Continuing where we left off in the Train-test split section, we will run the following commands in the terminal:

  1. Run the following commands in the terminal. Here, we are going to create a new S3 bucket that will contain the data we will be using in this chapter. Make sure that you replace the value of <INSERT BUCKET NAME HERE> with a bucket name that is globally unique across all AWS users:
    BUCKET_NAME="<INSERT BUCKET NAME HERE>"
    aws s3 mb s3://$BUCKET_NAME

For more information on S3 bucket naming rules, feel free to check out https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html.

  1. Now that the S3 bucket has been created, let’s upload the training and test datasets using the AWS CLI:
    S3=s3://$BUCKET_NAME/datasets/bookings
    TRAIN=bookings.train.csv
    TEST=bookings.test.csv
    aws s3 cp tmp/bookings.train.csv $S3/$TRAIN
    aws s3 cp tmp/bookings.test.csv $S3/$TEST

Now that everything is ready, we can proceed with the exciting part! It’s about time we perform multiple AutoML experiments using a variety of solutions and services.