Data Science Projects with Python

Data Science Projects with Python

By : Stephen Klosterman

Buy this Book

Data Science Projects with Python

By: Stephen Klosterman

Buy this Book

Overview of this book

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You’ll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you’ll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions. By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.

Data Science Projects with Python

Preface

Free Chapter

Data Exploration and Cleaning

Introduction

Python and the Anaconda Package Management System

Different Types of Data Science Problems

Loading the Case Study Data with Jupyter and pandas

Data Quality Assurance and Exploration

Exploring the Financial History Features in the Dataset

Summary

Introduction toScikit-Learn and Model Evaluation

Introduction

Exploring the Response Variable and Concluding the Initial Exploration

Introduction to Scikit-Learn

Model Performance Metrics for Binary Classification

Summary

Details of Logistic Regression and Feature Exploration

Introduction

Examining the Relationships between Features and the Response

Univariate Feature Selection: What It Does and Doesn't Do

Summary

The Bias-Variance Trade-off

Introduction

Estimating the Coefficients and Intercepts of Logistic Regression

Cross Validation: Choosing the Regularization Parameter and Other Hyperparameters

Summary

Decision Trees and Random Forests

Introduction

Decision trees

Random Forests: Ensembles of Decision Trees

Summary

Imputation of Missing Data, Financial Analysis, and Delivery to Client

Introduction

Review of Modeling Results

Dealing with Missing Data: Imputation Strategies

Final Thoughts on Delivering the Predictive Model to the Client

Summary

Appendix

Chapter 1: Data Exploration and Cleaning

Chapter 2: Introduction to Scikit-Learn and Model Evaluation

Chapter 3: Details of Logistic Regression and Feature Exploration

Chapter 4: The Bias-Variance Trade-off

Chapter 5: Decision Trees and Random Forests

Chapter 6: Imputation of Missing Data, Financial Analysis, and Delivery to Client

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 3: Details of Logistic Regression and Feature Exploration

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

The first few steps are similar to things we've done in previous activities:

Create a train/test split (80/20) with PAY_1 and LIMIT_BAL as features:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['PAY_1', 'LIMIT_BAL']].values, df['default payment next month'].values,
test_size=0.2, random_state=24)

Import LogisticRegression, with the default options, but set the solver to 'liblinear'.

from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver='liblinear')

Train on the training data and obtain predicted classes, as well as class probabilities, using the testing data:
```
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
y_pred_proba = lr_model.predict_proba(X_test)
```
Pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.
First, let's create the array of features, with a column of 1s added, using horizontal stacking:
```
ones_and_features = np.hstack([np.ones((X_test.shape[0],1)), X_test])
```
Now we need the intercept and coefficients, which we reshape and concatenate from scikit-learn output:
```
intercept_and_coefs = np.concatenate([lr_model.intercept_.reshape(1,1), lr_model.coef_], axis=1)
```
To repeatedly multiply the intercept and coefficients by the all the rows of ones_and_features, and take the sum of each row (that is, find the linear combination), you could write this all out using multiplication and addition. However, it's much faster to use the dot product:
```
X_lin_comb = np.dot(intercept_and_coefs, np.transpose(ones_and_features))
```
Now X_lin_comb has the argument we need to pass to the sigmoid function we defined, in order to calculate predicted probabilities:
```
y_pred_proba_manual = sigmoid(X_lin_comb)
```
Using a threshold of 0.5, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.
The manually predicted probabilities, y_pred_proba_manual, should be the same as y_pred_proba; we'll check that momentarily. First, manually predict the classes with the threshold:
```
y_pred_manual = y_pred_proba_manual >= 0.5
```
This array will have a different shape than y_pred, but it should contain the same values. We can check whether all the elements of two arrays are equal like this:
Figure 6.52: Equality of NumPy arrays
Calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.
First, import the following:
```
from sklearn.metrics import roc_auc_score
```
Then, calculate this metric on both versions, taking care to access the correct column, or reshape as necessary:
Figure 6.53: Calculating the ROC AUC's from predicted probabilities

The AUCs are, in fact, the same. What have we done here? We've confirmed that all we really need from this fitted scikit-learn model, are three numbers: the intercept and the two coefficients. Once we have these, we could create model predictions using a few lines of code, with mathematical functions, that are equivalent to the predictions directly made from scikit-learn.

This is good to confirm your understanding, but otherwise, why would you ever want to do this? We'll talk about model deployment in the final chapter. However, depending on your circumstances, you may be in a situation where you don't have access to Python in the environment where new features will need to be input to the model for prediction. For example, you may need to make predictions entirely in SQL. While this is a limitation in general, with logistic regression you can use mathematical functions that are available in SQL to re-create the logistic regression prediction, only needing to copy and paste the intercept and coefficients somewhere in your SQL code. The dot product may not be available, but you can use multiplication and addition to accomplish the same purpose.

Now, what about the results themselves? What we've seen here is that we can slightly boost model performance above our previous efforts: using just LIMIT_BAL as a feature in the previous chapter's Activity, the ROC AUC was a bit less at 0.62, instead of 0.63 here. In the next chapter, we'll learn advanced techniques with logistic regression that we can use to boost performance higher than this.

Data Science Projects with Python

By : Stephen Klosterman

Data Science Projects with Python

By: Stephen Klosterman

Overview of this book

Related Content you might be interested in

Current Title:

Data Science Projects with Python

Hands-On Predictive Analytics with Python

Applied Supervised Learning with Python

Data Cleaning and Exploration with Machine Learning

Chapter 3: Details of Logistic Regression and Feature Exploration

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients