Book Image

Data Science Projects with Python

By : Stephen Klosterman
Book Image

Data Science Projects with Python

By: Stephen Klosterman

Overview of this book

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You’ll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you’ll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions. By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.
Table of Contents (9 chapters)
Data Science Projects with Python
Preface

Chapter 4: The Bias-Variance Trade-off


Activity 4: Cross-Validation and Feature Engineering with the Case Study Data

  1. Select out the features from the DataFrame of the case study data.

    You can use the list of feature names that we've already created in this chapter. But be sure not to include the response variable, which would be a very good (but entirely inappropriate) feature:

    features = features_response[:-1]
    X = df[features].values
  2. Make a train/test split using a random seed of 24:

    X_train, X_test, y_train, y_test = train_test_split(X, df['default payment next month'].values,
    test_size=0.2, random_state=24)

    We'll use this going forward and reserve this testing data as the unseen test set. This way, we can easily create separate notebooks with other modeling approaches, using the same training data.

  3. Instantiate the MinMaxScaler to scale the data, as shown in the following code:

    from sklearn.preprocessing import MinMaxScaler
    min_max_sc = MinMaxScaler()
  4. Instantiate a logistic regression model with the saga solver, L1 penalty, and set max_iter to 1,000 as we'd like to allow the solver enough iterations to find a good solution:

    lr = LogisticRegression(solver='saga', penalty='l1', max_iter=1000)
  5. Import the Pipeline class and create a Pipeline with the scaler and the logistic regression model, using the names 'scaler' and 'model' for the steps, respectively:

    from sklearn.pipeline import Pipeline
    scale_lr_pipeline = Pipeline(steps=[('scaler', min_max_sc), ('model', lr)])
  6. Use the get_params and set_params methods to see how to view the parameters from each stage of the pipeline and change them:

    scale_lr_pipeline.get_params()
    scale_lr_pipeline.get_params()['model__C']
    scale_lr_pipeline.set_params(model__C = 2)
  7. Create a smaller range of C values to test with cross-validation, as these models will take longer to train and test with more data than our previous exercises; we recommend C = [102, 10, 1, 10-1, 10-2, 10-3]:

    C_val_exponents = np.linspace(2,-3,6)
    C_vals = np.float(10)**C_val_exponents
  8. Make a new version of the cross_val_C_search function, called cross_val_C_search_pipe. Instead of the model argument, this function will take a pipeline argument. The changes inside the function will be to set the C value using set_params(model__C = <value you want to test>) on the pipeline, replacing model with the pipeline for the fit and predict_proba methods, and accessing the C value using pipeline.get_params()['model__C'] for the printed status update.

    The changes are as follows:

    def cross_val_C_search_pipe(k_folds, C_vals, pipeline, X, Y):
    ##[…]
    pipeline.set_params(model__C = C_vals[c_val_counter])
    ##[…]
    pipeline.fit(X_cv_train, y_cv_train)
    ##[…]
    y_cv_train_predict_proba = pipeline.predict_proba(X_cv_train)
    ##[…]
    y_cv_test_predict_proba = pipeline.predict_proba(X_cv_test)
    ##[…]
    print('Done with C = {}'.format(pipeline.get_params()['model__C']))

    Note

    For the complete code, refer to http://bit.ly/2ZAy2Pr.

  9. Run this function as in the previous exercise, but using the new range of C values, the pipeline you created, and the features and response variable from the training split of the case study data. You may see warnings here, or in later steps, about the non-convergence of the solver; you could experiment with the tol or max_iter options to try and achieve convergence, although the results you obtain with max_iter = 1000 are likely to be sufficient. Here is the code to do this:

    cv_train_roc_auc, cv_test_roc_auc, cv_test_roc = \
    cross_val_C_search_pipe(k_folds, C_vals, scale_lr_pipeline, X_train, y_train)

    You will obtain the following output:

    Done with C = 100.0
    Done with C = 10.0
    Done with C = 1.0
    Done with C = 0.1
    Done with C = 0.01
    Done with C = 0.001
  10. Plot the average training and testing ROC AUC across folds, for each C value, using the following code:

    plt.plot(C_val_exponents, np.mean(cv_train_roc_auc, axis=0), '-o',
            label='Average training score')
    plt.plot(C_val_exponents, np.mean(cv_test_roc_auc, axis=0), '-x',
            label='Average testing score')
    plt.ylabel('ROC AUC')
    plt.xlabel('log$_{10}$(C)')
    plt.legend()
    plt.title('Cross validation on Case Study problem')
    np.mean(cv_test_roc_auc, axis=0)

    You will obtain the following output:

    Figure 6.54: Cross-validation testing performance

    You should notice that regularization does not impart much benefit here, as may be expected. While we are able to increase model performance over our previous efforts by using all the features available, it appears there is no overfitting going on. Instead, the training and testing scores are about the same. Instead of overfitting, it's possible that we may be underfitting. Let's try engineering some interaction features to see if they can improve performance.

  11. Create interaction features for the case study data and confirm that the number of new features makes sense using the following code:

    from sklearn.preprocessing import PolynomialFeatures
    make_interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
    X_interact = make_interactions.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(
    X_interact, df['default payment next month'].values,
    test_size=0.2, random_state=24)
    print(X_train.shape)
    print(X_test.shape)

    You will obtain the following output:

    (21331, 153)
    (5333, 153)

    From this you should see the new number of features is 153, which is 17 + "17 choose 2" = 17 + 136 = 153. The "17 choose 2" part comes from choosing all possible combinations of 2 features to interact from the possible 17.

  12. Repeat the cross-validation procedure and observe the model performance now; that is, repeat Steps 9 and 10. Note that this will take substantially more time, due to the larger number of features, but it will probably take only a few minutes.

    You will obtain the following output:

    Figure 6.55: Improved cross-validation testing performance from adding interaction features

So, does the average cross-validation testing performance improve with the interaction features? Is regularization useful?

Engineering the interaction features increases the best model testing score to about ROC AUC = 0.74 on average across the folds, from about 0.72 without including interactions. These scores happen at C = 100, that is, with negligible regularization. On the plot of training versus testing scores for the model with interactions, you can see that the training score is a bit higher than the testing score, so it could be said that some amount of overfitting is going on. However, we cannot increase the testing score through regularization here, so this may not be a problematic instance of overfitting. In most cases, whatever strategy yields the highest testing score is the best strategy.

We will reserve the step of fitting on all the training data for later, when we've tried other models in cross-validation to find the best model.