Data Science for Marketing Analytics

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.

Data Science for Marketing Analytics

Preface

Free Chapter

Data Preparation and Cleaning

Introduction

Data Models and Structured Data

pandas

Data Manipulation

Summary

Data Exploration and Visualization

Introduction

Identifying the Right Attributes

Generating Targeted Insights

Visualizing Data

Summary

Unsupervised Learning: Customer Segmentation

Introduction

Customer Segmentation Methods

Similarity and Data Standardization

k-means Clustering

Summary

Choosing the Best Segmentation Approach

Introduction

Choosing the Number of Clusters

Different Methods of Clustering

Evaluating Clustering

Summary

Predicting Customer Revenue Using Linear Regression

Introduction

Understanding Regression

Feature Engineering for Regression

Performing and Interpreting Linear Regression

Summary

Other Regression Techniques and Tools for Evaluation

Introduction

Evaluating the Accuracy of a Regression Model

Using Regularization for Feature Selection

Tree-Based Regression Models

Summary

Supervised Learning: Predicting Customer Churn

Introduction

Classification Problems

Understanding Logistic Regression

Creating a Data Science Pipeline

Modeling the Data

Summary

Fine-Tuning Classification Algorithms

Introduction

Support Vector Machines

Decision Trees

Random Forest

Preprocessing Data for Machine Learning Models

Model Evaluation

Performance Metrics

Summary

Modeling Customer Choice

Introduction

Understanding Multiclass Classification

Class Imbalanced Data

Summary

Appendix

Chapter 1: Data Preparation and Cleaning

Chapter 2: Data Exploration and Visualization

Chapter 3: Unsupervised Learning: Customer Segmentation

Chapter 4: Choosing the Best Segmentation Approach

Chapter 5: Predicting Customer Revenue Using Linear Regression

Chapter 6: Other Regression Techniques and Tools for Evaluation

Chapter 7: Supervised Learning: Predicting Customer Churn

Chapter 8: Fine-Tuning Classification Algorithms

Chapter 9: Modeling Customer Choice

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 1: Data Preparation and Cleaning

Activity 1: Addressing Data Spilling

Import pandas and copy into the console, as follows:
```
import pandas as pd
import copy
```
Use the read_excel function to read the xlsx file and the head function to look at the first few rows:
```
sales = pd.read_excel("sales.xlsx")
sales.head()
```
Look at the data types of sales and see if they make more sense:
```
sales.dtypes
```
You should get the following output:
Figure 1.57: Looking at the datatype of sales.xlsx
We can iterate through the DataFrame rows to understand how the data is in the first row and how it should be:
```
forlabel,content in sales.iteritems():
    print label, content[1]
```
This gives the following output:
Figure 1.58: Iterating through the first row

From the preceding output, you can infer that the data should actually contain column names starting with a capital letter. Add the following code to get an overview of what the data should actually look like:

d = {"Year": 2004, "Product line": "Camping Equipment", "Product type":"Cooking Gear", "Product":"TrailChef Water Bag", 
     "Order method type":"Telephone", "Retailer Country":"Canada", "Revenue":13444.68, "Planned revenue":14313.48, 
     "Product cost":6298.8, "Quantity":2172, "Unit cost":2.9, "Unit price":6.59, "Gross Profit":7145.88, "Unit sale price":6.19}

You should get the following output:

Figure 1.59: Looking at how the data should be structured

The Year column seems to be in its right place, so we can start by creating the Product line column. Let's see how many values it has leaked into:
```
sales.groupby(['Product','line'])['Year'].count()
```
Your output will be as follows:
Figure 1.60: Seeing how values are distributed across the columns
We are not sure whether the spillage is only restricted to these two columns, so let's look at the next column too, to be sure:
```
sales.groupby(['Product','line', 'Product.1'])['Year'].count()
```
Your output will be as follows:
Figure 1.61: Looking at data to get the number of columns required
Let's resolve the Product line column and see if we were able to do anything:
```
sales['Product line'] = sales.apply(lambda x: x['Product'] +''+ x['line'], axis = 1)
sales = sales.drop(['Product', 'line'], axis = 1)
sales.head()
```
The DataFrame should now look as follows:
Figure 1.62: Collecting data from multiple columns into the correct field
Instead of directly changing the sales DataFrame, let's create a copy of it, called tmp, and make changes to it. Once we have finalized the filtering procedure, we don't want to lose the original DataFrame. Let's continue looking at the next column of interest, that is, Product type:
```
tmp = copy.deepcopy(sales)
tmp.groupby(['Product.1','type', 'Product.2'])['Year'].count()
```
This gives the following output:
Figure 1.63: Seeing variation in number of words required to represent product type
We can see that some fields are not the fields we are interested in. Let's limit our view to only those that we are interested in.
As we are only interested in the Climbing Accessories, Cooking Gear, First Aid, Golf Accessories, Insect Repellents, and Sleeping Bags product types, so let's filter them out. Using a similar logic as before, we store these columns in a new DataFrame, tmp1:
```
tmp1 = copy.deepcopy(tmp[tmp['Product.1'].isin(['Climbing', 'Cooking', 'First', 'Golf', 'Insect', 'Sleeping'])])
tmp1.head()
```
This gives the following output:
Figure 1.64: Filtering out the categories we need
Then perform the following groupby operations as well:
```
tmp1.groupby(['Product.1')['Year'].count()
tmp1.groupby(['Product.1', 'type', 'Product.2'])['Year'].count()
```
This gives the following output:
Figure 1.65: Looking for the variation in the required product types categories
We can see that—because of our choice of fields—luckily, we only have to append two columns to get the attribute we need.
Now that we know tmp1 requires only two words for Product type, we are done:
```
tmp1['Product type'] = tmp1['Product.1'] + ''+ tmp1['type']
tmp1 = tmp1.drop(['Product.1', 'type'], axis = 1)
tmp1.head()
```
This gives the following output:
Figure 1.66: Joining data with these required categories
The next column we have to worry about is Product. Let's see how many columns we need for that:
```
tmp1.groupby(['Product.2', 'Order'])['Year'].count()
```
This gives the following output:
Figure 1.67: Looking at the variation in the product category
This column has some values that are one word and some that are more than that.
Let's create another variable, tmp2, for values containing more than one word:
```
tmp2 = copy.deepcopy(tmp1[~tmp1['Order'].isin(['E-mail', 'Fax', 'Mail', 'Sales', 'Special', 'Telephone', 'Web'])])
tmp2.head()
```
This gives the following output:
Figure 1.68: Filtering out observations with values containing more than one word
We look at the variation in tmp2 and see that, while most of the columns have two words, some columns have more than two words:
```
tmp2.groupby(['Product.2','Order', 'method'])['Year'].count()
```
This gives the following output:
Figure 1.69: Distribution in the next few fields
We keep on performing this procedure of repeatedly storing values that have more leakage into another variable, until we have exhausted all the columns and got a structured dataset by the last step:
```
tmp8 = copy.deepcopy(tmp7[tmp7['Product.3'].isin(['Kingdom', 'States'])])
tmp8.head()
```
This gives the following output:
Figure 1.70: Distribution after structuring the longest parts
After structuring the fields with the longest names for Products, Order method type, and Retailer country, we can see that there is no spillage in the columns containing numerical values. Let's handle the remaining cases directly.

We finish structuring the last part of the data directly and store the final structured data separately:

tmp8['Retail country'] = tmp8['revenue'] + '' + tmp8['Product.3']
tmp8 = tmp8.drop(['revenue', 'Product.3'], axis = 1)
tmp8["Revenue"] = tmp8['cost']
tmp8 = tmp8.drop(['cost'], axis = 1)
tmp8["Planned revenue"] = tmp8['Quantity'] 
tmp8 = tmp8.drop(['Quantity'], axis = 1)
tmp8["Product cost"] = tmp8['Unit'] 
tmp8 = tmp8.drop(['Unit'], axis = 1)
tmp8["Quantity"] = tmp8['cost.1'] 
tmp8 = tmp8.drop(['cost.1'], axis = 1)
tmp8["Unit cost"] = tmp8['Unit.1']
tmp8 = tmp8.drop(['Unit.1'], axis = 1)
tmp8["Unit price"] = tmp8['price']
tmp8 = tmp8.drop(['price'], axis = 1)
tmp8["Gross profit"] = tmp8['Gross']
tmp8 = tmp8.drop(['Gross'], axis = 1)
tmp8["Unit sale price"] = tmp8['profit']
tmp8 = tmp8.drop(['profit', 'Unit.2', 'sale', 'price.1'], axis = 1)
tmp8.head()

This gives the following output:

Figure 1.71: Fully structuring the data for the longest data

We store the structured dataset separately as str1:

str1 = tmp8

Once we structure the last layer of the data completely like this, it is easier for us to structure the layer just before it, as we can use the knowledge of the previous layer and reduce structuring the current layer to a problem we have already solved before:
```
temp = copy.deepcopy(tmp7[~tmp7.index.isin(tmp8.index.values)])
temp.head()
```
You should get the following output:
Figure 1.72: Structuring data with the second longest data fields

Now, we keep on going backward and structure the entire dataset with the help of the correctly structured preceding layers:

temp1['Retailer country'] = temp1['method']
temp1 = temp1.drop(['method'], axis = 1)
\temp1["Planned revenue"] = temp1['Retailer'] 
temp1 = temp1.drop(['Retailer'], axis = 1)
temp1["Product cost"] = temp1['country'] 
temp1 = temp1.drop(['country'], axis = 1)
temp1["Unit cost"] = temp1['Planned'] 
temp1 = temp1.drop(['Planned'], axis = 1)
temp1["Unit price"] = temp1['revenue'] 
temp1 = temp1.drop(['revenue'], axis = 1)
temp1["Gross profit"] = temp1['Product.3']
temp1 = temp1.drop(['Product.3'], axis = 1)
temp1["Unit sale price"] = temp1['cost']
temp1 = temp1.drop(['cost'], axis = 1)
temp1["Quantity"]  = temp1['Revenue']
temp1 = temp1.drop(['Revenue'], axis = 1)
temp1['Revenue'] = temp1['type.1']
temp1 = temp1.drop('type.1', axis = 1)
temp1 = temp1[['Year', 'Product line', 'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 
'Planned revenue', 'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]

temp1.head()

This gives the following output:

Figure 1.73: Iteratively going backward while stabilizing the longest fields

Finally, we just make sure to combine our data and we are done:

df = pd.concat([str1, str2, str3, str4, str5, str6, str7, str8, str9, str10, str11, str12, str13, str14, 
               str15, str16, str17, str18, str19, str20, str21, str22], sort = True)[['Year', 'Product line',
'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 'Planned revenue',
'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]
df.groupby('Product type').count()

This gives the following output:

Figure 1.74: Combining Data Sources

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Related Content you might be interested in

Current Title:

Data Science for Marketing Analytics

The Data Science Workshop

The Data Science Workshop

Applied Supervised Learning with Python

Chapter 1: Data Preparation and Cleaning

Activity 1: Addressing Data Spilling