Book Image

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar
Book Image

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.
Table of Contents (12 chapters)
Data Science for Marketing Analytics
Preface

Chapter 9: Modeling Customer Choice


Activity 18: Performing Multiclass Classification and Evaluating Performance

  1. Import pandas, numpy, randomforestclassifier, train_test_split, classification_report, confusion_matrix, accuracy_score, metrics, seaborn, matplotlib, and precision_recall_fscore_support:

    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    from sklearn import metrics
    from sklearn.metrics import precision_recall_fscore_support
    import matplotlib.pyplot as plt
    import seaborn as sns
  2. Load the marketing data using pandas:

    data= pd.read_csv(r'MarketingData.csv')
    data.head(5)
  3. Check the shape, the missing values, and show the summary report of the data:

    data.shape

    The shape should be (20000,7). Check for missing values:

    data.isnull().values.any()

    This will return False as there are no null values in the data. See the summary report of the data using the describe function:

    data.describe()
  4. Check the target variable, Channel, for the number of transactions for each of the channels:

    data['Channel'].value_counts()
  5. Split the data into training and testing sets:

    target = 'Channel'
    X = data.drop(['Channel'],axis=1)
    y=data[target]
    X_train, X_test, y_train, y_test = train_test_split(X.values,y,test_size=0.20, random_state=123, stratify=y)
  6. Fit a random forest classifier and store the model in a clf_random variable:

    clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
        min_samples_split=7, random_state=0)
    clf_random.fit(X_train,y_train)
  7. Predict on the test data and store the predictions in y_pred:

    y_pred=clf_random.predict(X_test)
  8. Find out the micro- and macro-average report:

    precision_recall_fscore_support(y_test, y_pred, average='macro')
    precision_recall_fscore_support(y_test, y_pred, average='micro')

    You will get approximately the following values as output for macro- and micro-averages respectively: 0.891, 0.891, 0.891, None and 0.891, 0.891, 0.891, None.

  9. Print the classification report:

    target_names = ["Retail","RoadShow","SocialMedia","Televison"]
    print(classification_report(y_test, y_pred,target_names=target_names))
  10. Plot the confusion matrix:

    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm,
                         index = target_names, 
                         columns = target_names)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()

From this activity, we can conclude that our random forest model was able to predict the most effective channel for marketing using customers' annual spend data with an accuracy of 89%.

Activity 19: Dealing with Imbalanced Data

  1. Import all the necessary libraries.

    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from imblearn.over_sampling import SMOTE
    from sklearn.preprocessing import StandardScaler
    from collections import Counter
  2. Read the dataset into a pandas DataFrame named bank and look at the first few rows of the data:

    bank = pd.read_csv('bank.csv', sep = ';')
    bank.head()
  3. Rename the y column as Target:

    bank = bank.rename(columns={
                            'y': 'Target'
                            })
  4. Replace the no value with 0 and yes with 1:

    bank['Target']=bank['Target'].replace({'no': 0, 'yes': 1})
  5. Check the shape and missing values in the data:

    bank.shape
    bank.isnull().values.any()
  6. Use the describe function to check the continuous and categorical values:

    bank.describe()
    bank.describe(include=['O'])
  7. Check the count of the class labels present in the target variable:

    bank['Target'].value_counts(0)
  8. Use the cat.codes function to encode the job, marital, default, housing, loan, contact, and poutcome columns:

    bank["job"] = bank["job"].astype('category').cat.codes
    bank["marital"] = bank["marital"].astype('category').cat.codes
    bank["default"] = bank["job"].astype('category').cat.codes
    bank["housing"] = bank["marital"].astype('category').cat.codes
    bank["loan"] = bank["loan"].astype('category').cat.codes
    bank["contact"] = bank["contact"].astype('category').cat.codes
    bank["poutcome"] = bank["poutcome"].astype('category').cat.codes

    Since education and month are ordinal columns, convert them as follows:

    bank['education']=bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
    bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)
    bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
    bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)
  9. Check the bank data after conversion:

    bank.head()
  10. Split the data into training and testing sets using train_test_split, as follows:

    target = 'Target'
    X = bank.drop(['Target'], axis=1)
    y=bank[target]
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
  11. Check the number of classes in y_train and y_test:

    print(sorted(Counter(y_train).items()))
    print(sorted(Counter(y_test).items()))
  12. Use the standard_scalar function to transform the X_train and X_test data. Assign it to the X_train_sc and X_test_sc variables:

    standard_scalar = StandardScaler()
    X_train_sc = standard_scalar.fit_transform(X_train)
    X_test_sc = standard_scalar.transform(X_test)
  13. Call the random forest classifier with parameters n_estimators=20, max_depth=None, min_samples_split=7, and random_state=0:

    clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
    min_samples_split=7, random_state=0)
  14. Fit the random forest model:

    clf_random.fit(X_train_sc,y_train)
  15. Predict on the test data using the random forest model:

    y_pred=clf_random.predict(X_test_sc)
  16. Get the classification report:

    target_names = ['No', 'Yes']
    print(classification_report(y_test, y_pred,target_names=target_names))
    cm = confusion_matrix(y_test, y_pred) 
  17. Get the confusion matrix:

    cm_df = pd.DataFrame(cm,
                         index = ['No', 'Yes'], 
                         columns = ['No', 'Yes'])
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()
  18. Use the smote() function on x_train and y_train. Assign it to the x_resampled and y_resampled variables, respectively:

    X_resampled, y_resampled = SMOTE().fit_resample(X_train,y_train)
  19. Use standard_scalar to fit on x_resampled and x_test. Assign it to the X_train_sc_resampled and X_test_sc variables:

    standard_scalar = StandardScaler()
    X_train_sc_resampled = standard_scalar.fit_transform(X_resampled)
    X_test_sc = standard_scalar.transform(X_test)
  20. Fit the random forest classifier on X_train_sc_resampled and y_resampled:

    clf_random.fit(X_train_sc_resampled,y_resampled)
  21. Predict on X_test_sc:

    y_pred=clf_random.predict(X_test_sc)
  22. Generate the classification report:

    target_names = ['No', 'Yes']
    print(classification_report(y_test, y_pred,target_names=target_names))
  23. Plot the confusion matrix:

    cm = confusion_matrix(y_test, y_pred) 
    
    cm_df = pd.DataFrame(cm,
                         index = ['No', 'Yes'], 
                         columns = ['No', 'Yes'])
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()

In this activity, our bank marketing data was highly imbalanced. We observed that, although without using a sampling technique our model accuracy is around 90%, the recall score and macro-average score was 32% (Yes - Term Deposit) and 65%, respectively. This implies that our model is not able to generalize, and most of the time it misses potential customers who would subscribe to the term deposit.

On the other hand, when we used SMOTE, our model accuracy was around 87%, but the recall score and macro-average score was 61% (Yes - Term Deposit) and 76%, respectively. This implies that our model can generalize and, more than 60% of the time, it detects potential customers who would subscribe to the term deposit.