Book Image

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar
Book Image

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.
Table of Contents (12 chapters)
Data Science for Marketing Analytics
Preface

Chapter 8: Fine-Tuning Classification Algorithms


Activity 15: Implementing Different Classification Algorithms

  1. Import the logistic regression library:

    from sklearn.linear_model import LogisticRegression
  2. Fit the model:

    clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train[top7_features], y_train)
    clf_logistic
  3. Score the model:

    clf_logistic.score(X_test[top7_features], y_test)
  4. Import the svm library:

    from sklearn import svm
  5. Fit the model:

    clf_svm=svm.SVC(kernel='linear', C=1)
    clf_svm.fit(X_train[top7_features],y_train)
  6. Score the model:

    clf_svm.score(X_test[top7_features], y_test)
  7. Import the decision tree library:

    from sklearn import tree
  8. Fit the model:

    clf_decision = tree.DecisionTreeClassifier()
    clf_decision.fit(X_train[top7_features],y_train)
  9. Score the model:

    clf_decision.score(X_test[top7_features], y_test)
  10. Import a random forest library:

    from sklearn.ensemble import RandomForestClassifier
  11. Fit the model:

    clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,     min_samples_split=7, random_state=0)
    clf_random.fit(X_train[top7_features], y_train)
  12. Score the model.

    clf_random.score(X_test[top7_features], y_test)

From the results, you can conclude that the random forest has outperformed the rest of the algorithms, with the decision tree having the lowest accuracy. In a later section, you will learn why accuracy is not the correct way to find a model's performance.

Activity 16: Tuning and Optimizing the Model

  1. Store five out of seven features, that is, Avg_Calls_Weekdays, Current_Bill_Amt, Avg_Calls, Account_Age, and Avg_Days_Delinquent in a variable top5_features. Store the other two features, Percent_Increase_MOM and Complaint_Code, in a variable top5_features.

    from sklearn import preprocessing
    ## Features to transform
    top5_features=['Avg_Calls_Weekdays', 'Current_Bill_Amt', 'Avg_Calls', 'Account_Age','Avg_Days_Delinquent']
    ## Features Left
    top2_features=['Percent_Increase_MOM','Complaint_Code']
  2. Use StandardScalar to standardize the five features.

    scaler = preprocessing.StandardScaler().fit(X_train[top5_features])
    X_train_scalar=pd.DataFrame(scaler.transform(X_train[top5_features]),columns = X_train[top5_features].columns)
  3. Create a variable X_train_scalar_combined, combine the standardized five features with the two features (Percent_Increase_MOM and Complaint_Code), which were not standardized.

    X_train_scalar_combined=pd.concat([X_train_scalar,  X_train[top2_features].reset_index(drop=True)], axis=1, sort=False)
  4. Apply the same scalar standardization to the test data (X_test_scalar_combined).

    X_test_scalar_combined=pd.concat([X_test_scalar,  X_test[top2_features].reset_index(drop=True)], axis=1, sort=False)
  5. Fit the random forest model.

    clf_random.fit(X_train_scalar_combined, y_train)
  6. Score the random forest model.

    clf_random.score(X_test_scalar_combined, y_test)
  7. Import the library for grid search and use the given parameters:

    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    parameters = [ {'min_samples_split': [4,5,7,9,10], 'n_estimators':[10,20,30,40,50,100,150,160,200,250,300],'max_depth': [2,5,7,10]}]
  8. Use grid search CV with stratified k-fold to find out the best parameters.

    clf_random_grid = GridSearchCV(RandomForestClassifier(), parameters, cv = StratifiedKFold(n_splits = 10))
    clf_random_grid.fit(X_train_scalar_combined, y_train)
  9. Print the best score and best parameters.

    print('best score train:', clf_random_grid.best_score_)
    print('best parameters train: ', clf_random_grid.best_params_)
  10. Score the model using the test data.

    clf_random_grid.score(X_test_scalar_combined, y_test)

Activity 17: Comparison of the Models

  1. Import the required libraries.

    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    from sklearn import metrics
  2. Fit the random forest classifier with the parameters obtained from grid search.

    clf_random_grid = RandomForestClassifier(n_estimators=100, max_depth=7,
         min_samples_split=10, random_state=0)
    clf_random_grid.fit(X_train_scalar_combined, y_train)
  3. Predict on the standardized scalar test data X_test_scalar_combined.

    y_pred=clf_random_grid.predict(X_test_scalar_combined)
  4. Fit the classification report.

    target_names = ['No Churn', 'Churn']
    print(classification_report(y_test, y_pred, target_names=target_names))
  5. Plot the confusion matrix.

    cm = confusion_matrix(y_test, y_pred) 
    cm_df = pd.DataFrame(cm,
                         index = ['No Churn','Churn'], 
                         columns = ['No Churn','Churn'])
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()
  6. Import the library for auc and roc curve.

    from sklearn.metrics import roc_curve,auc
  7. Use the classifiers which were created in our previous activity, that is, clf_logistic, clf_svm, clf_decision, and clf_random_grid. Create a dictionary of all these models.

    models = [
    {
        'label': 'Logistic Regression',
        'model': clf_logistic,
    },
    {
        'label': 'SVM',
        'model': clf_svm,
    },
    {
        'label': 'Decision Tree',
        'model': clf_decision,
    },
    {
        'label': 'Random Forest Grid Search',
        'model': clf_random_grid,
    }
    ]
  8. Plot the ROC curve.

    for m in models:
        model = m['model'] 
        model.fit(X_train_scalar_combined, y_train) 
        y_pred=model.predict(X_test_scalar_combined) 
        fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=1)
        roc_auc = metrics.auc(fpr, tpr)
        plt.plot(fpr, tpr, label='%s AUC = %0.2f' % (m['label'], roc_auc))
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.ylabel('Sensitivity(True Positive Rate)')
    plt.xlabel('1-Specificity(False Positive Rate)')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

Comparing the AUC result of different algorithms (logistic regression: 0.78; SVM: 0.79, decision tree: 0.77, and random forest: 0.82), we can conclude that random forest is the best performing model with the AUC score of 0.82 and can be chosen for the marketing team to predict customer churn.