Machine Learning Solutions

Machine Learning Solutions

Overview of this book

Machine learning (ML) helps you find hidden insights from your data without the need for explicit programming. This book is your key to solving any kind of ML problem you might come across in your job. You’ll encounter a set of simple to complex problems while building ML models, and you'll not only resolve these problems, but you’ll also learn how to build projects based on each problem, with a practical approach and easy-to-follow examples. The book includes a wide range of applications: from analytics and NLP, to computer vision domains. Some of the applications you will be working on include stock price prediction, a recommendation engine, building a chat-bot, a facial expression recognition system, and many more. The problem examples we cover include identifying the right algorithm for your dataset and use cases, creating and labeling datasets, getting enough clean data to carry out processing, identifying outliers, overftting datasets, hyperparameter tuning, and more. Here, you'll also learn to make more timely and accurate predictions. In addition, you'll deal with more advanced use cases, such as building a gaming bot, building an extractive summarization tool for medical documents, and you'll also tackle the problems faced while building an ML model. By the end of this book, you'll be able to fine-tune your models as per your needs to deliver maximum productivity.

Machine Learning Solutions

Foreword

Contributors

Preface

Free Chapter

Credit Risk Modeling

Introducing the problem statement

Understanding the dataset

Feature engineering for the baseline model

Selecting machine learning algorithms

Training the baseline model

Understanding the testing matrix

Testing the baseline model

Problems with the existing approach

Optimizing the existing approach

Implementing the revised approach

Best approach

Summary

Stock Market Price Prediction

Introducing the problem statement

Collecting the dataset

Understanding the dataset

Data preprocessing and data analysis

Feature engineering

Selecting the Machine Learning algorithm

Training the baseline model

Understanding the testing matrix

Testing the baseline model

Exploring problems with the existing approach

Understanding the revised approach

Implementing the revised approach

The best approach

Summary

Customer Analytics

Introducing customer segmentation

Understanding the datasets

Building the baseline approach

Building the revised approach

The best approach

Customer segmentation for various domains

Summary

Recommendation Systems for E-Commerce

Introducing the problem statement

Understanding the datasets

Building the baseline approach

Building the revised approach

The best approach

Summary

Sentiment Analysis

Introducing problem statements

Understanding the dataset

Building the training and testing datasets for the baseline model

Feature engineering for the baseline model

Selecting the machine learning algorithm

Training the baseline model

Understanding the testing matrix

Testing the baseline model

Problem with the existing approach

How to optimize the existing approach

Implementing the revised approach

The best approach

Summary

Job Recommendation Engine

Introducing the problem statement

Understanding the datasets

Building the baseline approach

Building the revised approach

The best approach

Summary

Text Summarization

Understanding the basics of summarization

Introducing the problem statement

Understanding datasets

Building the baseline approach

Building the revised approach

The best approach

Summary

Developing Chatbots

Introducing the problem statement

Understanding datasets

Building the basic version of a chatbot

Implementing the rule-based chatbot

Testing the rule-based chatbot

Problems with the existing approach

Implementing the revised approach

Testing the revised approach

Problems with the revised approach

The best approach

Discussing the hybrid approach

Summary

Building a Real-Time Object Recognition App

Introducing the problem statement

Understanding the dataset

Transfer Learning

Setting up the coding environment

Features engineering for the baseline model

Selecting the machine learning algorithm

Building the baseline model

Understanding the testing metrics

Testing the baseline model

Problem with existing approach

How to optimize the existing approach

Implementing the revised approach

The best approach

Summary

Face Recognition and Face Emotion Recognition

Introducing the problem statement

Setting up the coding environment

Understanding the concepts of face recognition

Approaches for implementing face recognition

Understanding the dataset for face emotion recognition

Understanding the concepts of face emotion recognition

Building the face emotion recognition model

Understanding the testing matrix

Testing the model

Problems with the existing approach

How to optimize the existing approach

The best approach

Summary

Building Gaming Bot

Introducing the problem statement

Setting up the coding environment

Understanding Reinforcement Learning (RL)

Basic Atari gaming bot

Implementing the basic version of the gaming bot

Building the Space Invaders gaming bot

Implementing the Space Invaders gaming bot

Building the Pong gaming bot

Implementing the Pong gaming bot

Just for fun - implementing the Flappy Bird gaming bot

Summary

List of Cheat Sheets

Cheat sheets

Summary

Strategy for Wining Hackathons

Strategy for winning hackathons

Keeping up to date

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Optimizing the existing approach

In this section, we will gain an understanding of the basic technicality regarding cross-validation and hyperparameter tuning. Once we understand the basics, it will be quite easy for us to implement them. Let's start with a basic understanding of cross-validation and hyperparameter tuning.

Understanding key concepts to optimize the approach

In this revised iteration, we need to improve the accuracy of the classifier. Here, we will cover the basic concepts first and then move on to the implementation part. So, we will understand two useful concepts:

Cross-validation
Hyperparameter tuning

Cross-validation

Cross-validation is also referred to as rotation estimation. It is basically used to track a problem called overfitting. Let me start with the overfitting problem first because the main purpose of using cross-validation is to avoid the overfitting situation.

Basically, when you train the model using the training dataset and check its accuracy, you find out that your training accuracy is quite good, but when you apply this trained model on an as-yet-unseen dataset, you realize that the trained model does not perform well on the unseen dataset and just mimics the output of the training dataset in terms of its target labels. So, we can say that our trained model is not able to generalize properly. This problem is called overfitting, and in order to solve this problem, we need to use cross-validation.

In our baseline approach, we didn't use cross-validation techniques extensively. The good part is that, so far, we generated our validation set of 25% of the training dataset and measured the classifier accuracy on that. This is a basic technique used to get an idea of whether the classifier suffers from overfitting or not.

There are many other cross validation techniques that will help us with two things:

Tracking the overfitting situation using CV: This will give us a perfect idea about the overfitting problem. We will use K-fold CV.
Model selection using CV: Cross-validation will help us select the classification models. This will also use K-fold CV.

Now let's look at the single approach that will be used for both of these tasks. You will find the implementation easy to understand.

The approach of using CV

The scikit-learn library provides great implementation of cross-validation. If we want to implement cross-validation, we just need to import the cross-validation module. In order to improvise on accuracy, we will use K-fold cross-validation. What this K-fold cross-validation basically does is explained here.

When we use the train-test split, we will train the model by using 75% of the data and validate the model by using 25% of the data. The main problem with this approach is that, actually, we are not using the whole training dataset for training. So, our model may not be able to come across all of the situations that are present in the training dataset. This problem has been solved by K-fold CV.

In K-fold CV, we need to provide the positive integer number for K. Here, you divide the training dataset into the K sub-dataset. Let me give you an example. If you have 125 data records in your training dataset and you set the value as k = 5, then each subset of the data gets 25 data records. So now, we have five subsets of the training dataset with 25 records each.

Let's understand how these five subsets of the dataset will be used. Based on the provided value of K, it will be decided how many times we need to iterate over these subsets of the data. Here we have taken K=5. So, we iterate over the dataset K-1 = 5-1 =4 times. Note that the number of iterations in K-fold CV is calculated by the equation K-1. Now let's see what happens to each of the iterations:

First iteration: We take one subset for testing and the remaining four subsets for training.
Second iteration: We take two subsets for testing and the remaining three subsets for training.
Third iteration: We take three subsets for testing and the remaining two subsets for training.
Fourth iteration: We take four subsets for testing and the remaining subset for training. After this fourth iteration, we don't have any subsets left for training or testing, so we stop after iteration K-1.

This approach has the following advantages:

K-fold CV uses all the data points for training, so our model takes advantage of getting trained using all of the data points.

After every iteration, we get the accuracy score. This will help us decide how models perform.
We generally consider the mean value and standard deviation value of the cross-validation after all the iterations have been completed. For each iteration, we track the accuracy score, and once all iterations have been done, we take the mean value of the accuracy score as well as derive the standard deviation (std) value from the accuracy scores. This CV mean and standard deviation score will help us identify whether the model suffers from overfitting or not.
If you perform this process for multiple algorithms then based on this mean score and the standard score, you can also decide which algorithm works best for the given dataset.

The disadvantage of this approach is as follows:

This k-fold CV is a time-consuming and computationally expensive method.

So after reading this, you hopefully understand the approach and, by using this implementation, we can ascertain whether our model suffers from overfitting or not. This technique will also help us select the ML algorithm. We will check out the implementation of this in the Implementing the Revised Approach section.

Now let's check out the next optimization technique, which is hyperparameter tuning.

Hyperparameter tuning

In this section, we will look at how we can use a hyperparameter-tuning technique to optimize the accuracy of our model. There are some kind of parameters whose value cannot be learnt during training process. These parameters are expressing higher-level properties of the ML model. These higher-level parameters are called hyperparameters. These are tuning nobs for ML model. We can obtain the best value for hyperparameter by trial and error. You can refer more on this by using this link: https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/, If we come up with the optimal value of the hyperparameters, then we will able to achieve the best accuracy for our model, but the challenging part is that we don't know the exact values of these parameters over our head. These parameters are the tuning knobs for our algorithm. So, we need to apply some techniques that will give us the best possible value for our hyperparameter, which we can use when we perform training.

In scikit-learn, there are two functions that we can use in order to find these hyperparameter values, which are as follows:

Grid search parameter tuning
Random search parameter tuning

Grid search parameter tuning

In this section, we will look at how grid search parameter tuning works. We specify the parameter values in a list called grid. Each value specified in grid has been taken in to consideration during the parameter tuning. . The model has been built and evaluated based on the specified grid value. This technique exhaustively considers all parameter combinations and generates the final optimal parameters.

Suppose we have five parameters that we want to optimize. Using this technique, if we want to try 10 different values for each of the parameters, then it will take 105 evaluations. Assume that, on average, for each parameter combination, 10 minutes are required for training; then, for the evaluation of 105, it will take years. Sounds crazy, right? This is the main disadvantage of this technique. This technique is very time consuming. So, a better solution is random search. '

Random search parameter tuning

The intuitive idea is the same as grid search, but the main difference is that instead of trying out all possible combinations, we will just randomly pick up the parameter from the selected subset of the grid. If I want to add on to my previous example, then in random search, we will take a random subset value of the parameter from 105 values. Suppose that we take only 1,000 values from 105 values and try to generate the optimal value for our hyperparameters. This way, we will save time.

In the revised approach, we will use this particular technique to optimize the hyperparameters.

From the next section, we will see the actual implementation of K-fold cross-validation and hyperparameter tuning. So let's start implementing our approach.

Machine Learning Solutions

Machine Learning Solutions

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning Solutions

Python Natural Language Processing

Reinforcement Learning with TensorFlow

Hands-On Recommendation Systems with Python

Optimizing the existing approach

Understanding key concepts to optimize the approach

Cross-validation

The approach of using CV

Hyperparameter tuning

Grid search parameter tuning

Random search parameter tuning