Book Image

Data Science Projects with Python

By : Stephen Klosterman
Book Image

Data Science Projects with Python

By: Stephen Klosterman

Overview of this book

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You’ll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you’ll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions. By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.
Table of Contents (9 chapters)
Data Science Projects with Python
Preface

Chapter 3: Details of Logistic Regression and Feature Exploration


Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

The first few steps are similar to things we've done in previous activities:

  1. Create a train/test split (80/20) with PAY_1 and LIMIT_BAL as features:

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
    df[['PAY_1', 'LIMIT_BAL']].values, df['default payment next month'].values,
    test_size=0.2, random_state=24)
  2. Import LogisticRegression, with the default options, but set the solver to 'liblinear'.

    from sklearn.linear_model import LogisticRegression
    lr_model = LogisticRegression(solver='liblinear')
  3. Train on the training data and obtain predicted classes, as well as class probabilities, using the testing data:

    lr_model.fit(X_train, y_train)
    y_pred = lr_model.predict(X_test)
    y_pred_proba = lr_model.predict_proba(X_test)
  4. Pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.

    First, let's create the array of features, with a column of 1s added, using horizontal stacking:

    ones_and_features = np.hstack([np.ones((X_test.shape[0],1)), X_test])

    Now we need the intercept and coefficients, which we reshape and concatenate from scikit-learn output:

    intercept_and_coefs = np.concatenate([lr_model.intercept_.reshape(1,1), lr_model.coef_], axis=1)

    To repeatedly multiply the intercept and coefficients by the all the rows of ones_and_features, and take the sum of each row (that is, find the linear combination), you could write this all out using multiplication and addition. However, it's much faster to use the dot product:

    X_lin_comb = np.dot(intercept_and_coefs, np.transpose(ones_and_features))

    Now X_lin_comb has the argument we need to pass to the sigmoid function we defined, in order to calculate predicted probabilities:

    y_pred_proba_manual = sigmoid(X_lin_comb)
  5. Using a threshold of 0.5, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.

    The manually predicted probabilities, y_pred_proba_manual, should be the same as y_pred_proba; we'll check that momentarily. First, manually predict the classes with the threshold:

    y_pred_manual = y_pred_proba_manual >= 0.5

    This array will have a different shape than y_pred, but it should contain the same values. We can check whether all the elements of two arrays are equal like this:

    Figure 6.52: Equality of NumPy arrays

  6. Calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.

    First, import the following:

    from sklearn.metrics import roc_auc_score

    Then, calculate this metric on both versions, taking care to access the correct column, or reshape as necessary:

    Figure 6.53: Calculating the ROC AUC's from predicted probabilities

The AUCs are, in fact, the same. What have we done here? We've confirmed that all we really need from this fitted scikit-learn model, are three numbers: the intercept and the two coefficients. Once we have these, we could create model predictions using a few lines of code, with mathematical functions, that are equivalent to the predictions directly made from scikit-learn.

This is good to confirm your understanding, but otherwise, why would you ever want to do this? We'll talk about model deployment in the final chapter. However, depending on your circumstances, you may be in a situation where you don't have access to Python in the environment where new features will need to be input to the model for prediction. For example, you may need to make predictions entirely in SQL. While this is a limitation in general, with logistic regression you can use mathematical functions that are available in SQL to re-create the logistic regression prediction, only needing to copy and paste the intercept and coefficients somewhere in your SQL code. The dot product may not be available, but you can use multiplication and addition to accomplish the same purpose.

Now, what about the results themselves? What we've seen here is that we can slightly boost model performance above our previous efforts: using just LIMIT_BAL as a feature in the previous chapter's Activity, the ROC AUC was a bit less at 0.62, instead of 0.63 here. In the next chapter, we'll learn advanced techniques with logistic regression that we can use to boost performance higher than this.