Book Image

Data Science Projects with Python

By : Stephen Klosterman
Book Image

Data Science Projects with Python

By: Stephen Klosterman

Overview of this book

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You’ll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you’ll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions. By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.
Table of Contents (9 chapters)
Data Science Projects with Python
Preface

Chapter 2: Introduction to Scikit-Learn and Model Evaluation


Activity 2: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

  1. Use scikit-learn's train_test_split to make a new set of training and testing data. This time, instead of EDUCATION, use LIMIT_BAL: the account's credit limit.

    Execute the following code to do this:

    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(
    df['LIMIT_BAL'].values.reshape(-1,1), df['default payment next month'].values,
    test_size=0.2, random_state=24)

    Notice here we create new training and testing splits, with new variable names.

  2. Train a logistic regression model using the training data from your split.

    The following code does this:

    example_lr.fit(X_train_2, y_train_2)

    We reuse the same model object, example_lr. We can re-train this object to learn the relationship between this new feature and the response. We could even try a different train/test split, if we wanted to, without creating a new model object. The existing model object has been updated in-place.

  3. Create the array of predicted probabilities for the testing data:

    Here is the code for this step:

    y_test_2_pred_proba = example_lr.predict_proba(X_test_2)
  4. Calculate the ROC AUC using the predicted probabilities and the true labels of the testing data. Compare this to the ROC AUC from using the EDUCATION feature:

    Run this code for this step:

    metrics.roc_auc_score(y_test_2, y_test_2_pred_proba[:,1])

    The output is as follows:

    Figure 6.47: Calculating the ROC AUC

    Notice we index the predicted probabilities array in order to get the predicted probability of the positive class from the second column. How does this compare to the ROC AUC from the EDUCATION logistic regression? The AUC is higher. This may be because now we are using a feature that has something to do with an account's financial status (credit limit), to predict something else related to the account's financial status (whether or not it will default), instead of using something less directly related to finances.

  5. Plot the ROC curve.

    Here is the code to do this; it's similar to the code we used in the previous exercise:

    fpr_2, tpr_2, thresholds_2 = metrics.roc_curve(y_test_2, y_test_2_pred_proba[:,1])
    plt.plot(fpr_2, tpr_2, '*-')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.legend(['Logistic regression', 'Random chance'])
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.title('ROC curve for logistic regression with LIMIT_BAL feature')

    The plot should appear as follows:

    Figure 6.48: ROC curve for the LIMIT_BAL logistic regression

    This looks a little closer to an ROC curve that we'd like to see: it's a bit further from the random chance line than the model using only EDUCATION. Also notice that the variation in pairs of true and false positive rates is a little smoother over the range of thresholds, reflective of the larger number of distinct values of the LIMIT_BAL feature.

  6. Calculate the data for the precision-recall curve on the testing data using scikit-learn functionality.

    Precision is often considered in tandem with recall. You are already familiar with recall. This is just another word for the true positive rate. We can use precision_recall_curve in sklearn.metrics to automatically vary the threshold and calculate pairs of precision and recall values at each one. Here is the code to retrieve these values, which is similar to roc_curve:

    precision, recall, thresh_3 = \
    metrics.precision_recall_curve(y_test_2, y_test_2_pred_proba[:,1])
  7. Plot the precision-recall curve using matplotlib: we can do this with the following code.

    Note that we put recall on the x-axis, precision on the y-axis, and set the axes limits to the range [0, 1]:

    plt.plot(recall, precision, '-x')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Another logistic regression with just one feature: LIMIT_BAL')
    plt.xlim([0, 1])
    plt.ylim([0, 1])

    Figure 6.49: Plot of the precision-recall curve

  8. Use scikit-learn to calculate the area under the precision-recall curve.

    Here is the code for this:

    metrics.auc(recall, precision)

    You will obtain the following output:

    Figure 6.50: Area under the precision-recall curve

    We saw that the precision-recall curve shows that precision is generally fairly low for this model; for nearly all of the range of thresholds, the precision, or portion of positive classifications that are correct, is less than half. We can calculate the area under the precision-recall curve as a way to compare this classifier with other models or feature sets we may consider.

    Scikit-learn offers functionality for calculating an area under the curve for any set of x-y data, using the trapezoid rule, which you may recall from calculus: metrics.auc. We used this functionality to get the area under the precision-recall curve.

  9. Now recalculate the ROC AUC, except this time do it for the training data. How is this different, conceptually and quantitatively, from your earlier calculation?

    First, we need to calculate predicted probabilities using the training data, as opposed to the testing data. Then we can calculate the ROC AUC using the training data labels. Here is the code:

    y_train_2_pred_proba = example_lr.predict_proba(X_train_2)
    metrics.roc_auc_score(y_train_2, y_train_2_pred_proba[:,1])

    You should obtain the following output:

    Figure 6.51: Training data ROC AUC

Quantitatively, we can see that this AUC is not all that different from the testing data ROC AUC we calculated earlier. Both are about 0.62. Conceptually, what is the difference? When we calculate this metric on the training data, we are measuring the model's skill in predicting the same data that "taught" the model how to make predictions. We are seeing how well the model fits the data. When we compare this to a testing data metric, we are comparing training and testing scores. If there was much of a difference in these scores, which usually would come in the form of a higher training score than testing score, it would indicate that while the model fits the data well, the trained model does not generalize well to new, unseen data.

In this case, the training and testing scores are similar, meaning the model does about as well on out-of-sample data as it does on the same data used in model training. We will learn more about the insights we can gain by comparing training and testing scores in Chapter 4, The Bias-Variance Trade-off.