The ML life cycle in practice
As Jeff Daniel's character in HBO's The Newsroom once said, the first step in solving any problem is recognizing there is one. Let's follow this knowledge and see if it works for us.
In this section, we'll pick a problem statement and execute the ML life cycle step by step. Once completed, we'll retrospect and identify any issues. The following diagram shows the different stages of ML:
Let's take a look at our problem statement.
Problem statement (plan and create)
For this exercise, let's assume that you own a retail business and would like to improve customer experience. First and foremost, you want to find your customer segments and customer lifetime value (LTV). If you have worked in the domain, you probably know different ways to solve this problem. I will follow a medium blog series called Know Your Metrics – Learn what and how to track with Python by Barış Karaman (https://towardsdatascience.com/data-driven-growth-with-python-part-1-know-your-metrics-812781e66a5b). You can go through the article for more details. Feel free to try it out for yourself. The dataset is available here: https://www.kaggle.com/vijayuv/onlineretail.
Data (preparation and cleaning)
First, let's install the pandas
package:
!pip install pandas
Let's make the dataset available to our notebook environment. To do that, download the dataset to your local system, then perform either of the following steps, depending on your setup:
- Local Jupyter: Copy the absolute path of the
.csv
file and give it as input to thepd.read_csv
method. - Google Colab: Upload the dataset by clicking on the folder icon and then the upload icon from the left navigation menu.
Let's preview the dataset:
import pandas as pd
retail_data = pd.read_csv('/content/OnlineRetail.csv',
encoding= 'unicode_escape')
retail_data.sample(5)
The output of the preceding code block is as follows:
As you can see, the dataset includes customer transaction data. The dataset consists of eight columns, apart from the index column, which is unlabeled:
InvoiceNo
: A unique order ID; the data is of theinteger
typeStockCode
: The unique ID of the product; the data is of thestring
typeDescription
: The product's description; the data is of thestring
typeQuantity
: The number of units of the product that have been orderedInvoiceDate
: The date when the invoice was generatedUnitPrice
: The cost of the product per unitCustomerID
: The unique ID of the customer who ordered the productCountry
: The country where the product was ordered
Once you have the dataset, before jumping into feature engineering and model building, data scientists usually perform some exploratory analysis. The idea here is to check if the dataset you have is sufficient to solve the problem, identify missing gaps, check if there is any correlation in the dataset, and more.
For the exercise, we'll calculate the monthly revenue and look at its seasonality. The following code block extracts year and month (yyyymm
) information from the InvoiceDate
column, calculates the revenue
property of each transaction by multiplying the UnitPrice
and Quantity
columns, and aggregates the revenue based on the extracted year-month (yyyymm
) column.
Let's continue from the preceding code statement:
##Convert 'InvoiceDate' to of type datetime
retail_data['InvoiceDate'] = pd.to_datetime(
retail_data['InvoiceDate'], errors = 'coerce')
##Extract year and month information from 'InvoiceDate'
retail_data['yyyymm']=retail_data['InvoiceDate'].dt.strftime('%Y%m')
##Calculate revenue generated per order
retail_data['revenue'] = retail_data['UnitPrice'] * retail_data['Quantity']
## Calculate monthly revenue by aggregating the revenue on year month column
revenue_df = retail_data.groupby(['yyyymm'])['revenue'].sum().reset_index()
revenue_df.head()
The preceding code will output the following DataFrame:
Let's visualize the revenue
DataFrame. I will be using a library called plotly
. The following command will install plotly
in your notebook environment:
!pip install plotly
Let's plot a bar graph from the revenue
DataFrame with the yyyymm
column on the x axis and revenue
on the y axis:
import plotly.express as px
##Sort rows on year-month column
revenue_df.sort_values( by=['yyyymm'], inplace=True)
## plot a bar graph with year-month on x-axis and revenue on y-axis, update x-axis is of type category.
fig = px.bar(revenue_df, x="yyyymm", y="revenue",
title="Monthly Revenue")
fig.update_xaxes(type='category')
fig.show()
The preceding codes sort the revenue DataFrame on the yyyymm
column and plot a bar graph of revenue
against the year-month (yyyymm
) column, as shown in the following screenshot. As you can see, September, October, and November are high revenue months. It would have been good to validate our assumption against a few years of data, but unfortunately, we don't have that. Before we move on to model development, let's look at one more metric – the monthly active customers – and see if it's co-related to monthly revenue:
Continuing in the same notebook, the following commands will calculate the monthly active customers by aggregating a count of unique CustomerID
on the year-month (yyyymm
) column:
active_customer_df = retail_data.groupby(['yyyymm'])['CustomerID'].nunique().reset_index()
active_customer_df.columns = ['yyyymm',
'No of Active customers']
active_customer_df.head()
The preceding code will produce the following output:
Let's plot the preceding DataFrame in the same way that we did for monthly revenue:
## Plot bar graph from revenue data frame with yyyymm column on x-axis and No. of active customers on the y-axis.
fig = px.bar(active_customer_df, x="yyyymm",
y="No of Active customers",
title="Monthly Active customers")
fig.update_xaxes(type='category')
fig.show()
The preceding command plots a bar graph of No of Active customers
against the year-month (yyyymm
) column. As shown in the following screenshot, Monthly Active customers
is positively related to the monthly revenue shown in the preceding screenshot:
In the next section, we'll build a customer LTV model.
Model
Now that we have finished exploring the data, let's build the LTV model. Customer lifetime value (CLTV) is defined as the net profitability associated with a customer's life cycle with the company. Simply put, CLV/LTV is a projection for what each customer is worth to a business (reference: https://www.toolbox.com/marketing/customer-experience/articles/what-is-customer-lifetime-value-clv/). There are different ways to predict lifetime value. One could be predicting the value of a customer, which is a regression problem, while another way could be predicting the customer group, which is a classification problem. In this exercise, we will use the latter approach.
For this exercise, we will segment customers into the following groups:
- Low LTV: Less active or low revenue customers
- Mid-LTV: Fairly active and moderate revenue customers
- High LTV: High revenue customers – the segment that we don't want to lose
We will be using 3 months worth of data to calculate the recency (R), frequency (F), and monetary (M) metrics of the customers to generate features. Once we have these features, we will use 6 months worth of data to calculate the revenue of every customer and generate LTV cluster labels (low LTV, mid-LTV, and high LTV). The generated labels and features will then be used to train an XGBoost model that can be used to predict the group of new customers.
Feature engineering
Let's continue our work in the same notebook, calculate the R, F, and M values for the customers, and group our customers based on a value that's been calculated from the individual R, F, and M scores:
- Recency (R): The recency metric represents how many days have passed since the customer made their last purchase.
- Frequency (F): As the term suggests, F represents how many times the customer made a purchase.
- Monetary (M): How much revenue a particular customer brought in.
Since the spending and purchase patterns of customers differ based on demographic location, we will only consider the data that belongs to the United Kingdom for this exercise. Let's read the OnlineRetails.csv
file and filter out the data that doesn't belong to the United Kingdom:
import pandas as pd
from datetime import datetime, timedelta, date
from sklearn.cluster import KMeans
##Read the data and filter out data that belongs to country other than UK
retail_data = pd.read_csv('/content/OnlineRetail.csv',
encoding= 'unicode_escape')
retail_data['InvoiceDate'] = pd.to_datetime(
retail_data['InvoiceDate'], errors = 'coerce')
uk_data = retail_data.query("Country=='United Kingdom'").reset_index(drop=True)
In the following code block, we will create two different DataFrames. The first one (uk_data_3m
) will be for InvoiceDate
between 2011-03-01
and 2011-06-01
. This DataFrame will be used to generate the RFM features. The second DataFrame (uk_data_6m
) will be for InvoiceDate
between 2011-06-01
and 2011-12-01
. This DataFrame will be used to generate the target column for model training. In this exercise, the target column is LTV groups/clusters. Since we are calculating the customer LTV group, a larger time interval would give a better grouping. Hence, we will be using 6 months worth of data to generate the LTV group labels:
## Create 3months and 6 months data frames
t1 = pd.Timestamp("2011-06-01 00:00:00.054000")
t2 = pd.Timestamp("2011-03-01 00:00:00.054000")
t3 = pd.Timestamp("2011-12-01 00:00:00.054000")
uk_data_3m = uk_data[(uk_data.InvoiceDate < t1) & (uk_data.InvoiceDate >= t2)].reset_index(drop=True)
uk_data_6m = uk_data[(uk_data.InvoiceDate >= t1) & (uk_data.InvoiceDate < t3)].reset_index(drop=True)
Now that we have two different DataFrames, let's calculate the RFM values using the uk_data_3m
DataFrame. The following code block calculates the revenue
column by multiplying UnitPrice
with Quantity
. To calculate the RFM values, the code block performs three aggregations on CustomerID
:
- To calculate R,
max_date
in the DataFrame must be calculated and for every customer, we must calculateR = max_date – x.max()
, wherex.max()
calculates the latestInvoiceDate
of a specificCustomerID
. - To calculate F,
count
the number of invoices for a specificCustomerID
. - To calculate M, find the
sum
value ofrevenue
for a specificCustomerID
.
The following code snippet performs this logic:
## Calculate RFM values.
uk_data_3m['revenue'] = uk_data_3m['UnitPrice'] * uk_data_3m['Quantity']
# Calculating the max invoice date in data (Adding additional day to avoid 0 recency value)
max_date = uk_data_3m['InvoiceDate'].max() + timedelta(days=1)
rfm_data = uk_data_3m.groupby(['CustomerID']).agg({
'InvoiceDate': lambda x: (max_date - x.max()).days,
'InvoiceNo': 'count',
'revenue': 'sum'})
rfm_data.rename(columns={'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'revenue': 'MonetaryValue'},
inplace=True)
Here, we have calculated the R, F, and M values for the customers. Next, we need to divide customers into the R, F, and M groups. This grouping defines where a customer stands concerning the other customers in terms of the R, F, and M metrics. To calculate the R, F, and M groups, we will divide the customers into equal-sized groups based on their R, F, and M values, respectively. These were calculated in the previous code block. To achieve this, we will use a method called pd.qcut
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html) on the DataFrame. Alternatively, you can use any clustering methods to divide customers into different groups. We will add the R, F, and M groups' values together to generate a single value called RFMScore
that will range from 0 to 9.
In this exercise, the customers will be divided into four groups. The elbow method (https://towardsdatascience.com/clustering-metrics-better-than-the-elbow-method-6926e1f723a6) can be used to calculate the optimal number of groups for any dataset. The preceding link also contains information about alternative methods you can use to calculate the optimal number of groups, so feel free to try it out. I will leave that as an exercise for you.
The following code block calculates RFMScore
:
## Calculate RFM groups of customers
r_grp = pd.qcut(rfm_data['Recency'], q=4,
labels=range(3,-1,-1))
f_grp = pd.qcut(rfm_data['Frequency'], q=4,
labels=range(0,4))
m_grp = pd.qcut(rfm_data['MonetaryValue'], q=4,
labels=range(0,4))
rfm_data = rfm_data.assign(R=r_grp.values).assign(F=f_grp.values).assign(M=m_grp.values)
rfm_data['R'] = rfm_data['R'].astype(int)
rfm_data['F'] = rfm_data['F'].astype(int)
rfm_data['M'] = rfm_data['M'].astype(int)
rfm_data['RFMScore'] = rfm_data['R'] + rfm_data['F'] + rfm_data['M']
rfm_data.groupby('RFMScore')['Recency','Frequency','MonetaryValue'].mean()
The preceding code will generate the following output:
This summary data gives us a rough idea of how RFMScore
is directly proportional to the Recency
, Frequency
, and MonetaryValue
metrics. For example, the group with RFMScore=0
has the highest mean recency (the last purchase day of this group is the farthest in past), the lowest mean frequency, and the lowest mean monetary value. On the other hand, the group with RFMScore=9
has the lowest mean recency, highest mean frequency, and highest mean monetary value.
With that, we understand RFMScore
is positively related to the value a customer brings to the business. So, let's segment customers as follows:
- 0-3 => Low value
- 4-6 => Mid value
- 7-9 => High value
The following code labels customers as having either a low, mid, or high value:
# segment customers.
rfm_data['Segment'] = 'Low-Value'
rfm_data.loc[rfm_data['RFMScore']>4,'Segment'] = 'Mid-Value'
rfm_data.loc[rfm_data['RFMScore']>6,'Segment'] = 'High-Value'
rfm_data = rfm_data.reset_index()
Customer LTV
Now that we have RFM features ready for the customers in the DataFrame that contains 3 months worth of data, let's use 6 months worth of data (uk_data_6m)
to calculate the revenue of the customers, as we did previously, and merge the RFM features with the newly created revenue DataFrame:
# Calculate revenue using the six month dataframe.
uk_data_6m['revenue'] = uk_data_6m['UnitPrice'] * uk_data_6m['Quantity']
revenue_6m = uk_data_6m.groupby(['CustomerID']).agg({
'revenue': 'sum'})
revenue_6m.rename(columns={'revenue': 'Revenue_6m'},
inplace=True)
revenue_6m = revenue_6m.reset_index()
revenue_6m = revenue_6m.dropna()
# Merge the 6m revenue data frame with RFM data.
merged_data = pd.merge(rfm_data, revenue_6m, how="left")
merged_data.fillna(0)
Feel free to plot revenue_6m
against RFMScore
. You will see a positive correlation between the two.
In the flowing code block, we are using the revenue_6m
columns, which is the lifetime value of a customer, and creating three groups called Low LTV, Mid LTV, and High LTV using K-means clustering. Again, you can verify the optimal number of clusters using the elbow method mentioned previously:
# Create LTV cluster groups
merged_data = merged_data[merged_data['Revenue_6m']<merged_data['Revenue_6m'].quantile(0.99)]
kmeans = KMeans(n_clusters=3)
kmeans.fit(merged_data[['Revenue_6m']])
merged_data['LTVCluster'] = kmeans.predict(merged_data[['Revenue_6m']])
merged_data.groupby('LTVCluster')['Revenue_6m'].describe()
The preceding code block produces the following output:
As you can see, the cluster with label 1 contains the group of customers whose lifetime value is very high since the mean revenue of the group is $14,123.309, whereas there are only 21 such customers. The cluster with label 0 contains the group of customers whose lifetime value is low since the mean revenue of the group is only $828.67, whereas there are 1,170 such customers. This grouping gives us an idea of which customers should always be kept happy.
The feature set and model
Let's build an XGBoost model using the features we have calculated so far so that the model can predict the LTV group of the customers, given the input features. The following is the final feature set that will be used as input for the model:
feature_data = pd.get_dummies(merged_data)
feature_data.head(5)
The preceding code block produces the following DataFrame. This includes the feature set that will be used to train the model:
Now, let's use this feature set to train the Xgboost
model. The prediction label (y
) is the LTVCluster
column; the rest of the dataset except for the Revenue_6m
and CustomerID
columns are the X
value. Revenue_6m
will be dropped from the feature set as the LTVCluster
column (y
) is calculated using Revenue_6m
. For the new customer, we can calculate other features without needing at least 6 months worth of data and also predict their LTVCluster(y)
.
The following code will train the Xgboost
model:
from sklearn.metrics import classification_report, confusion_matrix
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
#Splitting data into train and test data set.
X = feature_data.drop(['CustomerID', 'LTVCluster',
'Revenue_6m'], axis=1)
y = feature_data['LTVCluster']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
xgb_classifier = xgb.XGBClassifier(max_depth=5, objective='multi:softprob')
xgb_model = xgb_classifier.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print(classification_report(y_test, y_pred))
The preceding code block will output the following classification results:
Now, let's assume that we are happy with the model and want to take it to the next level – that is, to production.
Package, release, and monitor
So far, we have spent a lot of time looking at data analysis, exploration, cleaning, and model building since that is what a data scientist should concentrate on. But once all that work has been done, can the model be deployed without any additional work? The answer is no. We are still far away from deployment. We must do the following things before we can deploy the model:
- We must create a scheduled data pipeline that performs data cleaning and feature engineering.
- We need a way to fetch features during prediction. If it's an online/transactional model, there should be a way to fetch features at low latency. Since customers' R, F, and M values change frequently, let's say that we want to run two different campaigns for mid-value and high-value segments on the website. There will be a need to score customers in near-real time.
- Find a way to reproduce the model using the historical data.
- Perform model packaging and versioning.
- Find a way to AB test the model.
- Find a way to monitor model and data drift.
As we don't have any of these ready, let's stop here and look back at what we have done, if there is a way to do this better, and see if there are any common oversights.
In the next section, we'll look at what we think we have built (ideal world) versus what we have built (real world).