Data Science for Marketing Analytics

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.

Data Science for Marketing Analytics

Preface

Free Chapter

Data Preparation and Cleaning

Introduction

Data Models and Structured Data

pandas

Data Manipulation

Summary

Data Exploration and Visualization

Introduction

Identifying the Right Attributes

Generating Targeted Insights

Visualizing Data

Summary

Unsupervised Learning: Customer Segmentation

Introduction

Customer Segmentation Methods

Similarity and Data Standardization

k-means Clustering

Summary

Choosing the Best Segmentation Approach

Introduction

Choosing the Number of Clusters

Different Methods of Clustering

Evaluating Clustering

Summary

Predicting Customer Revenue Using Linear Regression

Introduction

Understanding Regression

Feature Engineering for Regression

Performing and Interpreting Linear Regression

Summary

Other Regression Techniques and Tools for Evaluation

Introduction

Evaluating the Accuracy of a Regression Model

Using Regularization for Feature Selection

Tree-Based Regression Models

Summary

Supervised Learning: Predicting Customer Churn

Introduction

Classification Problems

Understanding Logistic Regression

Creating a Data Science Pipeline

Modeling the Data

Summary

Fine-Tuning Classification Algorithms

Introduction

Support Vector Machines

Decision Trees

Random Forest

Preprocessing Data for Machine Learning Models

Model Evaluation

Performance Metrics

Summary

Modeling Customer Choice

Introduction

Understanding Multiclass Classification

Class Imbalanced Data

Summary

Appendix

Chapter 1: Data Preparation and Cleaning

Chapter 2: Data Exploration and Visualization

Chapter 3: Unsupervised Learning: Customer Segmentation

Chapter 4: Choosing the Best Segmentation Approach

Chapter 5: Predicting Customer Revenue Using Linear Regression

Chapter 6: Other Regression Techniques and Tools for Evaluation

Chapter 7: Supervised Learning: Predicting Customer Churn

Chapter 8: Fine-Tuning Classification Algorithms

Chapter 9: Modeling Customer Choice

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 9: Modeling Customer Choice

Activity 18: Performing Multiclass Classification and Evaluating Performance

Import pandas, numpy, randomforestclassifier, train_test_split, classification_report, confusion_matrix, accuracy_score, metrics, seaborn, matplotlib, and precision_recall_fscore_support:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support
import matplotlib.pyplot as plt
import seaborn as sns

Load the marketing data using pandas:

data= pd.read_csv(r'MarketingData.csv')
data.head(5)

Check the shape, the missing values, and show the summary report of the data:
```
data.shape
```
The shape should be (20000,7). Check for missing values:
```
data.isnull().values.any()
```
This will return False as there are no null values in the data. See the summary report of the data using the describe function:
```
data.describe()
```
Check the target variable, Channel, for the number of transactions for each of the channels:
```
data['Channel'].value_counts()
```

Split the data into training and testing sets:

target = 'Channel'
X = data.drop(['Channel'],axis=1)
y=data[target]
X_train, X_test, y_train, y_test = train_test_split(X.values,y,test_size=0.20, random_state=123, stratify=y)

Fit a random forest classifier and store the model in a clf_random variable:

clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
    min_samples_split=7, random_state=0)
clf_random.fit(X_train,y_train)

Predict on the test data and store the predictions in y_pred:
```
y_pred=clf_random.predict(X_test)
```
Find out the micro- and macro-average report:
```
precision_recall_fscore_support(y_test, y_pred, average='macro')
precision_recall_fscore_support(y_test, y_pred, average='micro')
```
You will get approximately the following values as output for macro- and micro-averages respectively: 0.891, 0.891, 0.891, None and 0.891, 0.891, 0.891, None.

Print the classification report:

target_names = ["Retail","RoadShow","SocialMedia","Televison"]
print(classification_report(y_test, y_pred,target_names=target_names))

Plot the confusion matrix:

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = target_names, 
                     columns = target_names)
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

From this activity, we can conclude that our random forest model was able to predict the most effective channel for marketing using customers' annual spend data with an accuracy of 89%.

Activity 19: Dealing with Imbalanced Data

Import all the necessary libraries.

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from collections import Counter

Read the dataset into a pandas DataFrame named bank and look at the first few rows of the data:
```
bank = pd.read_csv('bank.csv', sep = ';')
bank.head()
```

Rename the y column as Target:

bank = bank.rename(columns={
                        'y': 'Target'
                        })

Replace the no value with 0 and yes with 1:

bank['Target']=bank['Target'].replace({'no': 0, 'yes': 1})

Check the shape and missing values in the data:
```
bank.shape
bank.isnull().values.any()
```
Use the describe function to check the continuous and categorical values:
```
bank.describe()
bank.describe(include=['O'])
```
Check the count of the class labels present in the target variable:
```
bank['Target'].value_counts(0)
```

Use the cat.codes function to encode the job, marital, default, housing, loan, contact, and poutcome columns:

bank["job"] = bank["job"].astype('category').cat.codes
bank["marital"] = bank["marital"].astype('category').cat.codes
bank["default"] = bank["job"].astype('category').cat.codes
bank["housing"] = bank["marital"].astype('category').cat.codes
bank["loan"] = bank["loan"].astype('category').cat.codes
bank["contact"] = bank["contact"].astype('category').cat.codes
bank["poutcome"] = bank["poutcome"].astype('category').cat.codes

Since education and month are ordinal columns, convert them as follows:

bank['education']=bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)
bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)

Check the bank data after conversion:
```
bank.head()
```

Split the data into training and testing sets using train_test_split, as follows:

target = 'Target'
X = bank.drop(['Target'], axis=1)
y=bank[target]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

Check the number of classes in y_train and y_test:

print(sorted(Counter(y_train).items()))
print(sorted(Counter(y_test).items()))

Use the standard_scalar function to transform the X_train and X_test data. Assign it to the X_train_sc and X_test_sc variables:

standard_scalar = StandardScaler()
X_train_sc = standard_scalar.fit_transform(X_train)
X_test_sc = standard_scalar.transform(X_test)

Call the random forest classifier with parameters n_estimators=20, max_depth=None, min_samples_split=7, and random_state=0:
```
clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
min_samples_split=7, random_state=0)
```
Fit the random forest model:
```
clf_random.fit(X_train_sc,y_train)
```
Predict on the test data using the random forest model:
```
y_pred=clf_random.predict(X_test_sc)
```

Get the classification report:

target_names = ['No', 'Yes']
print(classification_report(y_test, y_pred,target_names=target_names))
cm = confusion_matrix(y_test, y_pred)

Get the confusion matrix:

cm_df = pd.DataFrame(cm,
                     index = ['No', 'Yes'], 
                     columns = ['No', 'Yes'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

Use the smote() function on x_train and y_train. Assign it to the x_resampled and y_resampled variables, respectively:
```
X_resampled, y_resampled = SMOTE().fit_resample(X_train,y_train)
```

Use standard_scalar to fit on x_resampled and x_test. Assign it to the X_train_sc_resampled and X_test_sc variables:

standard_scalar = StandardScaler()
X_train_sc_resampled = standard_scalar.fit_transform(X_resampled)
X_test_sc = standard_scalar.transform(X_test)

Fit the random forest classifier on X_train_sc_resampled and y_resampled:
```
clf_random.fit(X_train_sc_resampled,y_resampled)
```
Predict on X_test_sc:
```
y_pred=clf_random.predict(X_test_sc)
```

Generate the classification report:

target_names = ['No', 'Yes']
print(classification_report(y_test, y_pred,target_names=target_names))

Plot the confusion matrix:

cm = confusion_matrix(y_test, y_pred) 

cm_df = pd.DataFrame(cm,
                     index = ['No', 'Yes'], 
                     columns = ['No', 'Yes'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

In this activity, our bank marketing data was highly imbalanced. We observed that, although without using a sampling technique our model accuracy is around 90%, the recall score and macro-average score was 32% (Yes - Term Deposit) and 65%, respectively. This implies that our model is not able to generalize, and most of the time it misses potential customers who would subscribe to the term deposit.

On the other hand, when we used SMOTE, our model accuracy was around 87%, but the recall score and macro-average score was 61% (Yes - Term Deposit) and 76%, respectively. This implies that our model can generalize and, more than 60% of the time, it detects potential customers who would subscribe to the term deposit.

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Related Content you might be interested in

Current Title:

Data Science for Marketing Analytics

The Data Science Workshop

The Data Science Workshop

Applied Supervised Learning with Python

Chapter 9: Modeling Customer Choice

Activity 18: Performing Multiclass Classification and Evaluating Performance

Activity 19: Dealing with Imbalanced Data