Data Science Projects with Python

Data Science Projects with Python

By : Stephen Klosterman

Buy this Book

Data Science Projects with Python

By: Stephen Klosterman

Buy this Book

Overview of this book

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You’ll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you’ll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions. By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.

Data Science Projects with Python

Preface

Free Chapter

Data Exploration and Cleaning

Introduction

Python and the Anaconda Package Management System

Different Types of Data Science Problems

Loading the Case Study Data with Jupyter and pandas

Data Quality Assurance and Exploration

Exploring the Financial History Features in the Dataset

Summary

Introduction toScikit-Learn and Model Evaluation

Introduction

Exploring the Response Variable and Concluding the Initial Exploration

Introduction to Scikit-Learn

Model Performance Metrics for Binary Classification

Summary

Details of Logistic Regression and Feature Exploration

Introduction

Examining the Relationships between Features and the Response

Univariate Feature Selection: What It Does and Doesn't Do

Summary

The Bias-Variance Trade-off

Introduction

Estimating the Coefficients and Intercepts of Logistic Regression

Cross Validation: Choosing the Regularization Parameter and Other Hyperparameters

Summary

Decision Trees and Random Forests

Introduction

Decision trees

Random Forests: Ensembles of Decision Trees

Summary

Imputation of Missing Data, Financial Analysis, and Delivery to Client

Introduction

Review of Modeling Results

Dealing with Missing Data: Imputation Strategies

Final Thoughts on Delivering the Predictive Model to the Client

Summary

Appendix

Chapter 1: Data Exploration and Cleaning

Chapter 2: Introduction to Scikit-Learn and Model Evaluation

Chapter 3: Details of Logistic Regression and Feature Exploration

Chapter 4: The Bias-Variance Trade-off

Chapter 5: Decision Trees and Random Forests

Chapter 6: Imputation of Missing Data, Financial Analysis, and Delivery to Client

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 4: The Bias-Variance Trade-off

Activity 4: Cross-Validation and Feature Engineering with the Case Study Data

Select out the features from the DataFrame of the case study data.
You can use the list of feature names that we've already created in this chapter. But be sure not to include the response variable, which would be a very good (but entirely inappropriate) feature:
```
features = features_response[:-1]
X = df[features].values
```
Make a train/test split using a random seed of 24:
```
X_train, X_test, y_train, y_test = train_test_split(X, df['default payment next month'].values,
test_size=0.2, random_state=24)
```
We'll use this going forward and reserve this testing data as the unseen test set. This way, we can easily create separate notebooks with other modeling approaches, using the same training data.
Instantiate the MinMaxScaler to scale the data, as shown in the following code:
```
from sklearn.preprocessing import MinMaxScaler
min_max_sc = MinMaxScaler()
```
Instantiate a logistic regression model with the saga solver, L1 penalty, and set max_iter to 1,000 as we'd like to allow the solver enough iterations to find a good solution:
```
lr = LogisticRegression(solver='saga', penalty='l1', max_iter=1000)
```
Import the Pipeline class and create a Pipeline with the scaler and the logistic regression model, using the names 'scaler' and 'model' for the steps, respectively:
```
from sklearn.pipeline import Pipeline
scale_lr_pipeline = Pipeline(steps=[('scaler', min_max_sc), ('model', lr)])
```
Use the get_params and set_params methods to see how to view the parameters from each stage of the pipeline and change them:
```
scale_lr_pipeline.get_params()
scale_lr_pipeline.get_params()['model__C']
scale_lr_pipeline.set_params(model__C = 2)
```
Create a smaller range of C values to test with cross-validation, as these models will take longer to train and test with more data than our previous exercises; we recommend C = [102, 10, 1, 10-1, 10-2, 10-3]:
```
C_val_exponents = np.linspace(2,-3,6)
C_vals = np.float(10)**C_val_exponents
```
Make a new version of the cross_val_C_search function, called cross_val_C_search_pipe. Instead of the model argument, this function will take a pipeline argument. The changes inside the function will be to set the C value using set_params(model__C = <value you want to test>) on the pipeline, replacing model with the pipeline for the fit and predict_proba methods, and accessing the C value using pipeline.get_params()['model__C'] for the printed status update.
The changes are as follows:
```
def cross_val_C_search_pipe(k_folds, C_vals, pipeline, X, Y):
##[…]
pipeline.set_params(model__C = C_vals[c_val_counter])
##[…]
pipeline.fit(X_cv_train, y_cv_train)
##[…]
y_cv_train_predict_proba = pipeline.predict_proba(X_cv_train)
##[…]
y_cv_test_predict_proba = pipeline.predict_proba(X_cv_test)
##[…]
print('Done with C = {}'.format(pipeline.get_params()['model__C']))
```
Note
For the complete code, refer to http://bit.ly/2ZAy2Pr.
Run this function as in the previous exercise, but using the new range of C values, the pipeline you created, and the features and response variable from the training split of the case study data. You may see warnings here, or in later steps, about the non-convergence of the solver; you could experiment with the tol or max_iter options to try and achieve convergence, although the results you obtain with max_iter = 1000 are likely to be sufficient. Here is the code to do this:
```
cv_train_roc_auc, cv_test_roc_auc, cv_test_roc = \
cross_val_C_search_pipe(k_folds, C_vals, scale_lr_pipeline, X_train, y_train)
```
You will obtain the following output:
```
Done with C = 100.0
Done with C = 10.0
Done with C = 1.0
Done with C = 0.1
Done with C = 0.01
Done with C = 0.001
```
Plot the average training and testing ROC AUC across folds, for each C value, using the following code:
```
plt.plot(C_val_exponents, np.mean(cv_train_roc_auc, axis=0), '-o',
        label='Average training score')
plt.plot(C_val_exponents, np.mean(cv_test_roc_auc, axis=0), '-x',
        label='Average testing score')
plt.ylabel('ROC AUC')
plt.xlabel('log$_{10}$(C)')
plt.legend()
plt.title('Cross validation on Case Study problem')
np.mean(cv_test_roc_auc, axis=0)
```
You will obtain the following output:
Figure 6.54: Cross-validation testing performance
You should notice that regularization does not impart much benefit here, as may be expected. While we are able to increase model performance over our previous efforts by using all the features available, it appears there is no overfitting going on. Instead, the training and testing scores are about the same. Instead of overfitting, it's possible that we may be underfitting. Let's try engineering some interaction features to see if they can improve performance.

Create interaction features for the case study data and confirm that the number of new features makes sense using the following code:

from sklearn.preprocessing import PolynomialFeatures
make_interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = make_interactions.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X_interact, df['default payment next month'].values,
test_size=0.2, random_state=24)
print(X_train.shape)
print(X_test.shape)

You will obtain the following output:

(21331, 153)
(5333, 153)

From this you should see the new number of features is 153, which is 17 + "17 choose 2" = 17 + 136 = 153. The "17 choose 2" part comes from choosing all possible combinations of 2 features to interact from the possible 17.

Repeat the cross-validation procedure and observe the model performance now; that is, repeat Steps 9 and 10. Note that this will take substantially more time, due to the larger number of features, but it will probably take only a few minutes.
You will obtain the following output:
Figure 6.55: Improved cross-validation testing performance from adding interaction features

So, does the average cross-validation testing performance improve with the interaction features? Is regularization useful?

Engineering the interaction features increases the best model testing score to about ROC AUC = 0.74 on average across the folds, from about 0.72 without including interactions. These scores happen at C = 100, that is, with negligible regularization. On the plot of training versus testing scores for the model with interactions, you can see that the training score is a bit higher than the testing score, so it could be said that some amount of overfitting is going on. However, we cannot increase the testing score through regularization here, so this may not be a problematic instance of overfitting. In most cases, whatever strategy yields the highest testing score is the best strategy.

We will reserve the step of fitting on all the training data for later, when we've tried other models in cross-validation to find the best model.

Data Science Projects with Python

By : Stephen Klosterman

Data Science Projects with Python

By: Stephen Klosterman

Overview of this book

Related Content you might be interested in

Current Title:

Data Science Projects with Python

Hands-On Predictive Analytics with Python

Applied Supervised Learning with Python

Data Cleaning and Exploration with Machine Learning

Chapter 4: The Bias-Variance Trade-off

Activity 4: Cross-Validation and Feature Engineering with the Case Study Data

Note