Data Science for Marketing Analytics

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.

Data Science for Marketing Analytics

Preface

Free Chapter

Data Preparation and Cleaning

Introduction

Data Models and Structured Data

pandas

Data Manipulation

Summary

Data Exploration and Visualization

Introduction

Identifying the Right Attributes

Generating Targeted Insights

Visualizing Data

Summary

Unsupervised Learning: Customer Segmentation

Introduction

Customer Segmentation Methods

Similarity and Data Standardization

k-means Clustering

Summary

Choosing the Best Segmentation Approach

Introduction

Choosing the Number of Clusters

Different Methods of Clustering

Evaluating Clustering

Summary

Predicting Customer Revenue Using Linear Regression

Introduction

Understanding Regression

Feature Engineering for Regression

Performing and Interpreting Linear Regression

Summary

Other Regression Techniques and Tools for Evaluation

Introduction

Evaluating the Accuracy of a Regression Model

Using Regularization for Feature Selection

Tree-Based Regression Models

Summary

Supervised Learning: Predicting Customer Churn

Introduction

Classification Problems

Understanding Logistic Regression

Creating a Data Science Pipeline

Modeling the Data

Summary

Fine-Tuning Classification Algorithms

Introduction

Support Vector Machines

Decision Trees

Random Forest

Preprocessing Data for Machine Learning Models

Model Evaluation

Performance Metrics

Summary

Modeling Customer Choice

Introduction

Understanding Multiclass Classification

Class Imbalanced Data

Summary

Appendix

Chapter 1: Data Preparation and Cleaning

Chapter 2: Data Exploration and Visualization

Chapter 3: Unsupervised Learning: Customer Segmentation

Chapter 4: Choosing the Best Segmentation Approach

Chapter 5: Predicting Customer Revenue Using Linear Regression

Chapter 6: Other Regression Techniques and Tools for Evaluation

Chapter 7: Supervised Learning: Predicting Customer Churn

Chapter 8: Fine-Tuning Classification Algorithms

Chapter 9: Modeling Customer Choice

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 8: Fine-Tuning Classification Algorithms

Activity 15: Implementing Different Classification Algorithms

Import the logistic regression library:

from sklearn.linear_model import LogisticRegression

Fit the model:

clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train[top7_features], y_train)
clf_logistic

Score the model:

clf_logistic.score(X_test[top7_features], y_test)

Import the svm library:
```
from sklearn import svm
```

Fit the model:

clf_svm=svm.SVC(kernel='linear', C=1)
clf_svm.fit(X_train[top7_features],y_train)

Score the model:

clf_svm.score(X_test[top7_features], y_test)

Import the decision tree library:
```
from sklearn import tree
```

Fit the model:

clf_decision = tree.DecisionTreeClassifier()
clf_decision.fit(X_train[top7_features],y_train)

Score the model:

clf_decision.score(X_test[top7_features], y_test)

Import a random forest library:

from sklearn.ensemble import RandomForestClassifier

Fit the model:

clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,     min_samples_split=7, random_state=0)
clf_random.fit(X_train[top7_features], y_train)

Score the model.

clf_random.score(X_test[top7_features], y_test)

From the results, you can conclude that the random forest has outperformed the rest of the algorithms, with the decision tree having the lowest accuracy. In a later section, you will learn why accuracy is not the correct way to find a model's performance.

Activity 16: Tuning and Optimizing the Model

Store five out of seven features, that is, Avg_Calls_Weekdays, Current_Bill_Amt, Avg_Calls, Account_Age, and Avg_Days_Delinquent in a variable top5_features. Store the other two features, Percent_Increase_MOM and Complaint_Code, in a variable top5_features.
```
from sklearn import preprocessing
## Features to transform
top5_features=['Avg_Calls_Weekdays', 'Current_Bill_Amt', 'Avg_Calls', 'Account_Age','Avg_Days_Delinquent']
## Features Left
top2_features=['Percent_Increase_MOM','Complaint_Code']
```

Use StandardScalar to standardize the five features.

scaler = preprocessing.StandardScaler().fit(X_train[top5_features])
X_train_scalar=pd.DataFrame(scaler.transform(X_train[top5_features]),columns = X_train[top5_features].columns)

Create a variable X_train_scalar_combined, combine the standardized five features with the two features (Percent_Increase_MOM and Complaint_Code), which were not standardized.
```
X_train_scalar_combined=pd.concat([X_train_scalar,  X_train[top2_features].reset_index(drop=True)], axis=1, sort=False)
```

Apply the same scalar standardization to the test data (X_test_scalar_combined).

X_test_scalar_combined=pd.concat([X_test_scalar,  X_test[top2_features].reset_index(drop=True)], axis=1, sort=False)

Fit the random forest model.

clf_random.fit(X_train_scalar_combined, y_train)

Score the random forest model.

clf_random.score(X_test_scalar_combined, y_test)

Import the library for grid search and use the given parameters:

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
parameters = [ {'min_samples_split': [4,5,7,9,10], 'n_estimators':[10,20,30,40,50,100,150,160,200,250,300],'max_depth': [2,5,7,10]}]

Use grid search CV with stratified k-fold to find out the best parameters.

clf_random_grid = GridSearchCV(RandomForestClassifier(), parameters, cv = StratifiedKFold(n_splits = 10))
clf_random_grid.fit(X_train_scalar_combined, y_train)

Print the best score and best parameters.

print('best score train:', clf_random_grid.best_score_)
print('best parameters train: ', clf_random_grid.best_params_)

Score the model using the test data.

clf_random_grid.score(X_test_scalar_combined, y_test)

Activity 17: Comparison of the Models

Import the required libraries.

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import metrics

Fit the random forest classifier with the parameters obtained from grid search.

clf_random_grid = RandomForestClassifier(n_estimators=100, max_depth=7,
     min_samples_split=10, random_state=0)
clf_random_grid.fit(X_train_scalar_combined, y_train)

Predict on the standardized scalar test data X_test_scalar_combined.
```
y_pred=clf_random_grid.predict(X_test_scalar_combined)
```

Fit the classification report.

target_names = ['No Churn', 'Churn']
print(classification_report(y_test, y_pred, target_names=target_names))

Plot the confusion matrix.

cm = confusion_matrix(y_test, y_pred) 
cm_df = pd.DataFrame(cm,
                     index = ['No Churn','Churn'], 
                     columns = ['No Churn','Churn'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

Import the library for auc and roc curve.

from sklearn.metrics import roc_curve,auc

Use the classifiers which were created in our previous activity, that is, clf_logistic, clf_svm, clf_decision, and clf_random_grid. Create a dictionary of all these models.

models = [
{
    'label': 'Logistic Regression',
    'model': clf_logistic,
},
{
    'label': 'SVM',
    'model': clf_svm,
},
{
    'label': 'Decision Tree',
    'model': clf_decision,
},
{
    'label': 'Random Forest Grid Search',
    'model': clf_random_grid,
}
]

Plot the ROC curve.

for m in models:
    model = m['model'] 
    model.fit(X_train_scalar_combined, y_train) 
    y_pred=model.predict(X_test_scalar_combined) 
    fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=1)
    roc_auc = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label='%s AUC = %0.2f' % (m['label'], roc_auc))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('Sensitivity(True Positive Rate)')
plt.xlabel('1-Specificity(False Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Comparing the AUC result of different algorithms (logistic regression: 0.78; SVM: 0.79, decision tree: 0.77, and random forest: 0.82), we can conclude that random forest is the best performing model with the AUC score of 0.82 and can be chosen for the marketing team to predict customer churn.

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Related Content you might be interested in

Current Title:

Data Science for Marketing Analytics

The Data Science Workshop

The Data Science Workshop

Applied Supervised Learning with Python

Chapter 8: Fine-Tuning Classification Algorithms

Activity 15: Implementing Different Classification Algorithms

Activity 16: Tuning and Optimizing the Model

Activity 17: Comparison of the Models