Book Image

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur
Book Image

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!
Table of Contents (9 chapters)

Chapter 6: Model Evaluation

Activity 15: Final Test Project


  1. Import the relevant libraries:

    import pandas as pd
    import numpy as np
    import json
    %matplotlib inline
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.model_selection import RandomizedSearchCV, train_test_split
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import (accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve)
  2. Read the attrition_train.csv dataset. Read the CSV file into a DataFrame and print the .info() of the DataFrame:

    data = pd.read_csv('attrition_train.csv')

    The output will be as follows:

    Figure 6.33: Output of info()

  3. Read the JSON file with the details of the categorical variables. The JSON file contains a dictionary, where the keys are the column names of the categorical features and the corresponding values are the list of categories in the feature. This file will help us one-hot encode the categorical features into numerical features. Use the json library to load the file object into a dictionary, and print the dictionary:

    with open('categorical_variable_values.json', 'r') as f:
        cat_values_dict = json.load(f)

    The output will be as follows:

    Figure 6.34: The JSON file

  4. Process the dataset to convert all features to numerical values. First, find the number of columns that will stay in their original form (that is, numerical features) and that need to be one-hot encoded (that is, the categorical features). data.shape[1] gives us the number of columns in data, and we subtract len(cat_values_dict) from it to get the number of numerical columns. To find the number of categorical columns, we simply count the total number of categories across all categorical variables from the cat_values_dict dictionary:

    num_orig_cols = data.shape[1] - len(cat_values_dict)
    num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
    print(num_orig_cols, num_enc_cols)

    The output will be:

    26 24

    Create a NumPy array of zeros as a placeholder, with a shape equal to the total number of columns, as determined previously, minus one (because the Attrition target variable is also included here). For the numerical columns, we then create a mask that selects the numerical columns from the DataFrame and assigns them to the first num_orig_cols-1 columns in the array, X:

    X = np.zeros(shape=(data.shape[0], num_orig_cols+num_enc_cols-1))
    mask = [(each not in cat_values_dict and each != 'Attrition') for each in data.columns]
    X[:, :num_orig_cols-1] = data.loc[:, data.columns[mask]]

    Next, we initialize the OneHotEncoder class from scikit-learn with a list containing the list of values in each categorical column. Then, we transform the categorical columns to one-hot encoded columns and assign them to the remaining columns in X, and save the values of the target variable in the y variable:

    cat_cols = list(cat_values_dict.keys())
    cat_values = [cat_values_dict[col] for col in data[cat_cols].columns]
    ohe = OneHotEncoder(categories=cat_values, sparse=False, )
    X[:, num_orig_cols-1:] = ohe.fit_transform(X=data[cat_cols])
    y = data.Attrition.values

    The output will be:

    (1176, 49)
  5. Choose a base model and define the range of hyperparameter values corresponding to the model to be searched over for hyperparameter tuning. Let's use a gradient boosted classifier as our model. We then define ranges of values for all hyperparameters we want to tune in the form of a dictionary:

    meta_gbc = GradientBoostingClassifier()
    param_dist = {
        'n_estimators': list(range(10, 210, 10)),
        'criterion': ['mae', 'mse'],
        'max_features': ['sqrt', 'log2', 0.25, 0.3, 0.5, 0.8, None],
        'max_depth': list(range(1, 10)),
        'min_samples_leaf': list(range(1, 10))
  6. Define the parameters with which to initialize the RandomizedSearchCV object and use K-fold cross-validation to find the best model hyperparameters. Define the parameters required for random search, including cv as 5, indicating that the hyperparameters should be chosen by evaluating the performance using 5-fold cross-validation. Then, initialize the RandomizedSearchCV object and use the .fit() method to begin the optimization:

    rand_search_params = {
        'param_distributions': param_dist,
        'scoring': 'accuracy',
        'n_iter': 100,
        'cv': 5,
        'return_train_score': True,
        'n_jobs': -1,
        'random_state': 11
    random_search = RandomizedSearchCV(meta_gbc, **rand_search_params), y)

    The output will be as follows:

    Figure 6.35: Output of the optimization process

    Once the tuning is complete, find the position (iteration number) at which the highest mean test score was obtained. Find the corresponding hyperparameters and save them to a dictionary:

    idx = np.argmax(random_search.cv_results_['mean_test_score'])
    final_params = random_search.cv_results_['params'][idx]

    The output will be:

    Figure 6.36: The hyperparameters dictionary

  7. Split the dataset into training and validation sets and train a new model using the final hyperparameters on the training dataset. Use scikit-learn's train_test_split() method to split X and y into train and test components, with test comprising 15% of the dataset:

    train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.15, random_state=11)
    print(train_X.shape, train_y.shape, val_X.shape, val_y.shape)

    The output will be:

    ((999, 49), (999,), (177, 49), (177,))

    Train the gradient boosted classification model using the final hyperparameters and make predictions on the training and validation sets. Also calculate the probability on the validation set:

    gbc = GradientBoostingClassifier(**final_params), train_y)
    preds_train = gbc.predict(train_X)
    preds_val = gbc.predict(val_X)
    pred_probs_val = np.array([each[1] for each in gbc.predict_proba(val_X)])
  8. Calculate the accuracy, precision, and recall for predictions on the validation set, and print the confusion matrix:

    print('train accuracy_score = {}'.format(accuracy_score(y_true=train_y, y_pred=preds_train)))
    print('validation accuracy_score = {}'.format(accuracy_score(y_true=val_y, y_pred=preds_val)))
    print('confusion_matrix: \n{}'.format(confusion_matrix(y_true=val_y, y_pred=preds_val)))
    print('precision_score = {}'.format(precision_score(y_true=val_y, y_pred=preds_val)))
    print('recall_score = {}'.format(recall_score(y_true=val_y, y_pred=preds_val)))

    The output will be as follows:

    Figure 6.37: Accuracy, precision, recall, and the confusion matrix

  9. Experiment with varying thresholds to find the optimal point with high recall.

    Plot the precision-recall curve:

    precision, recall, thresholds = precision_recall_curve(val_y, pred_probs_val)
    plt.plot(recall, precision)

    The output will be as follows:

    Figure 6.38: The precision-recall curve

    Plot the variation in precision and recall with increasing threshold values:

    PR_variation_df = pd.DataFrame({'precision': precision, 'recall': recall}, index=list(thresholds)+[1])
    plt.ylabel('P/R values')

    The output will be as follows:

    Figure 6.39: Variation in precision and recall with increasing threshold values

  10. Finalize a threshold that will be used for predictions on the test dataset. Let's finalize a value, say, 0.3. This value is entirely dependent on what you feel would be optimal based on your exploration in the previous step:

    final_threshold = 0.3
  11. Read and process the test dataset to convert all features to numerical values. This will be done in a manner similar to that in step 4, with the only difference that we don't need to account for the target variable column, as the dataset does not contain it:

    test = pd.read_csv('attrition_test.csv')
    num_orig_cols = test.shape[1] - len(cat_values_dict)
    num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
    print(num_orig_cols, num_enc_cols)
    test_X = np.zeros(shape=(test.shape[0], num_orig_cols+num_enc_cols))
    mask = [(each not in cat_values_dict) for each in test.columns]
    test_X[:, :num_orig_cols] = test.loc[:, test.columns[mask]]
    cat_cols = list(cat_values_dict.keys())
    cat_values = [cat_values_dict[col] for col in test[cat_cols].columns]
    ohe = OneHotEncoder(categories=cat_values, sparse=False, )
    test_X[:, num_orig_cols:] = ohe.fit_transform(X=test[cat_cols])
  12. Predict the final values on the test dataset and save them to a file. Use the final threshold value determined in step 10 to find the classes for each value in the training set. Then, write the final predictions to the final_predictions.csv file:

    pred_probs_test = np.array([each[1] for each in gbc.predict_proba(test_X)])
    preds_test = (pred_probs_test > final_threshold).astype(int)
    with open('final_predictions.csv', 'w') as f:
        f.writelines([str(val)+'\n' for val in preds_test])

    The output will be a CSV file, as follows:

    Figure 6.40: The CSV file