Applied Supervised Learning with Python

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur

Buy this Book

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Buy this Book

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!

Applied Supervised Learning with Python

Preface

Free Chapter

Python Machine Learning Toolkit

Introduction

Supervised Machine Learning

Jupyter Notebooks

pandas

Data Quality Considerations

Summary

Exploratory Data Analysis and Visualization

Introduction

Summary Statistics and Central Values

Missing Values

Distribution of Values

Relationships within the Data

Summary

Regression Analysis

Introduction

Regression and Classification Problems

Linear Regression

Multiple Linear Regression

Autoregression Models

Summary

Classification

Introduction

Linear Regression as a Classifier

Logistic Regression

Classification Using K-Nearest Neighbors

Classification Using Decision Trees

Summary

Ensemble Modeling

Introduction

Overfitting and Underfitting

Bagging

Boosting

Summary

Model Evaluation

Introduction

Evaluation Metrics

Splitting the Dataset

Performance Improvement Tactics

Summary

Appendix

Chapter 1: Python Machine Learning Toolkit

Chapter 2: Exploratory Data Analysis and Visualization

Chapter 3: Regression Analysis

Chapter 4: Classification

Chapter 5: Ensemble Modeling

Chapter 6: Model Evaluation

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 6: Model Evaluation

Activity 15: Final Test Project

Solution

Import the relevant libraries:

import pandas as pd
import numpy as np
import json

%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve)

Read the attrition_train.csv dataset. Read the CSV file into a DataFrame and print the .info() of the DataFrame:
```
data = pd.read_csv('attrition_train.csv')
data.info()
```
The output will be as follows:
Figure 6.33: Output of info()
Read the JSON file with the details of the categorical variables. The JSON file contains a dictionary, where the keys are the column names of the categorical features and the corresponding values are the list of categories in the feature. This file will help us one-hot encode the categorical features into numerical features. Use the json library to load the file object into a dictionary, and print the dictionary:
```
with open('categorical_variable_values.json', 'r') as f:
    cat_values_dict = json.load(f)
cat_values_dict
```
The output will be as follows:
Figure 6.34: The JSON file
Process the dataset to convert all features to numerical values. First, find the number of columns that will stay in their original form (that is, numerical features) and that need to be one-hot encoded (that is, the categorical features). data.shape[1] gives us the number of columns in data, and we subtract len(cat_values_dict) from it to get the number of numerical columns. To find the number of categorical columns, we simply count the total number of categories across all categorical variables from the cat_values_dict dictionary:
```
num_orig_cols = data.shape[1] - len(cat_values_dict)
num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
print(num_orig_cols, num_enc_cols)
```
The output will be:
```
26 24
```
Create a NumPy array of zeros as a placeholder, with a shape equal to the total number of columns, as determined previously, minus one (because the Attrition target variable is also included here). For the numerical columns, we then create a mask that selects the numerical columns from the DataFrame and assigns them to the first num_orig_cols-1 columns in the array, X:
```
X = np.zeros(shape=(data.shape[0], num_orig_cols+num_enc_cols-1))

mask = [(each not in cat_values_dict and each != 'Attrition') for each in data.columns]
X[:, :num_orig_cols-1] = data.loc[:, data.columns[mask]]
```
Next, we initialize the OneHotEncoder class from scikit-learn with a list containing the list of values in each categorical column. Then, we transform the categorical columns to one-hot encoded columns and assign them to the remaining columns in X, and save the values of the target variable in the y variable:
```
cat_cols = list(cat_values_dict.keys())
cat_values = [cat_values_dict[col] for col in data[cat_cols].columns]

ohe = OneHotEncoder(categories=cat_values, sparse=False, )

X[:, num_orig_cols-1:] = ohe.fit_transform(X=data[cat_cols])
y = data.Attrition.values

print(X.shape)
print(y.shape)
```
The output will be:
```
(1176, 49)
(1176,)
```
Choose a base model and define the range of hyperparameter values corresponding to the model to be searched over for hyperparameter tuning. Let's use a gradient boosted classifier as our model. We then define ranges of values for all hyperparameters we want to tune in the form of a dictionary:
```
meta_gbc = GradientBoostingClassifier()

param_dist = {
    'n_estimators': list(range(10, 210, 10)),
    'criterion': ['mae', 'mse'],
    'max_features': ['sqrt', 'log2', 0.25, 0.3, 0.5, 0.8, None],
    'max_depth': list(range(1, 10)),
    'min_samples_leaf': list(range(1, 10))
}
```
Define the parameters with which to initialize the RandomizedSearchCV object and use K-fold cross-validation to find the best model hyperparameters. Define the parameters required for random search, including cv as 5, indicating that the hyperparameters should be chosen by evaluating the performance using 5-fold cross-validation. Then, initialize the RandomizedSearchCV object and use the .fit() method to begin the optimization:
```
rand_search_params = {
    'param_distributions': param_dist,
    'scoring': 'accuracy',
    'n_iter': 100,
    'cv': 5,
    'return_train_score': True,
    'n_jobs': -1,
    'random_state': 11
}
random_search = RandomizedSearchCV(meta_gbc, **rand_search_params)
random_search.fit(X, y)
```
The output will be as follows:
Figure 6.35: Output of the optimization process
Once the tuning is complete, find the position (iteration number) at which the highest mean test score was obtained. Find the corresponding hyperparameters and save them to a dictionary:
```
idx = np.argmax(random_search.cv_results_['mean_test_score'])
final_params = random_search.cv_results_['params'][idx]
final_params
```
The output will be:
Figure 6.36: The hyperparameters dictionary
Split the dataset into training and validation sets and train a new model using the final hyperparameters on the training dataset. Use scikit-learn's train_test_split() method to split X and y into train and test components, with test comprising 15% of the dataset:
```
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.15, random_state=11)
print(train_X.shape, train_y.shape, val_X.shape, val_y.shape)
```
The output will be:
```
((999, 49), (999,), (177, 49), (177,))
```
Train the gradient boosted classification model using the final hyperparameters and make predictions on the training and validation sets. Also calculate the probability on the validation set:
```
gbc = GradientBoostingClassifier(**final_params)
gbc.fit(train_X, train_y)

preds_train = gbc.predict(train_X)
preds_val = gbc.predict(val_X)
pred_probs_val = np.array([each[1] for each in gbc.predict_proba(val_X)])
```

Calculate the accuracy, precision, and recall for predictions on the validation set, and print the confusion matrix:

print('train accuracy_score = {}'.format(accuracy_score(y_true=train_y, y_pred=preds_train)))
print('validation accuracy_score = {}'.format(accuracy_score(y_true=val_y, y_pred=preds_val)))

print('confusion_matrix: \n{}'.format(confusion_matrix(y_true=val_y, y_pred=preds_val)))
print('precision_score = {}'.format(precision_score(y_true=val_y, y_pred=preds_val)))
print('recall_score = {}'.format(recall_score(y_true=val_y, y_pred=preds_val)))

The output will be as follows:

Figure 6.37: Accuracy, precision, recall, and the confusion matrix

Experiment with varying thresholds to find the optimal point with high recall.

Plot the precision-recall curve:

plt.figure(figsize=(10,7))

precision, recall, thresholds = precision_recall_curve(val_y, pred_probs_val)
plt.plot(recall, precision)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

The output will be as follows:

Figure 6.38: The precision-recall curve

Plot the variation in precision and recall with increasing threshold values:

PR_variation_df = pd.DataFrame({'precision': precision, 'recall': recall}, index=list(thresholds)+[1])

PR_variation_df.plot(figsize=(10,7))
plt.xlabel('Threshold')
plt.ylabel('P/R values')
plt.show()

The output will be as follows:

Figure 6.39: Variation in precision and recall with increasing threshold values

Finalize a threshold that will be used for predictions on the test dataset. Let's finalize a value, say, 0.3. This value is entirely dependent on what you feel would be optimal based on your exploration in the previous step:
```
final_threshold = 0.3
```

Read and process the test dataset to convert all features to numerical values. This will be done in a manner similar to that in step 4, with the only difference that we don't need to account for the target variable column, as the dataset does not contain it:

test = pd.read_csv('attrition_test.csv')
test.info()


num_orig_cols = test.shape[1] - len(cat_values_dict)
num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
print(num_orig_cols, num_enc_cols)


test_X = np.zeros(shape=(test.shape[0], num_orig_cols+num_enc_cols))

mask = [(each not in cat_values_dict) for each in test.columns]
test_X[:, :num_orig_cols] = test.loc[:, test.columns[mask]]

cat_cols = list(cat_values_dict.keys())
cat_values = [cat_values_dict[col] for col in test[cat_cols].columns]

ohe = OneHotEncoder(categories=cat_values, sparse=False, )

test_X[:, num_orig_cols:] = ohe.fit_transform(X=test[cat_cols])
print(test_X.shape)

Predict the final values on the test dataset and save them to a file. Use the final threshold value determined in step 10 to find the classes for each value in the training set. Then, write the final predictions to the final_predictions.csv file:
```
pred_probs_test = np.array([each[1] for each in gbc.predict_proba(test_X)])
preds_test = (pred_probs_test > final_threshold).astype(int)

with open('final_predictions.csv', 'w') as f:
    f.writelines([str(val)+'\n' for val in preds_test])
```
The output will be a CSV file, as follows:
Figure 6.40: The CSV file

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Related Content you might be interested in

Current Title:

Applied Supervised Learning with Python

Data Science for Marketing Analytics

Ensemble Machine Learning Cookbook

Machine Learning with scikit-learn Quick Start Guide

Chapter 6: Model Evaluation

Activity 15: Final Test Project