Data Science for Marketing Analytics

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Buy this Book

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.

Data Science for Marketing Analytics

Preface

Free Chapter

Data Preparation and Cleaning

Introduction

Data Models and Structured Data

pandas

Data Manipulation

Summary

Data Exploration and Visualization

Introduction

Identifying the Right Attributes

Generating Targeted Insights

Visualizing Data

Summary

Unsupervised Learning: Customer Segmentation

Introduction

Customer Segmentation Methods

Similarity and Data Standardization

k-means Clustering

Summary

Choosing the Best Segmentation Approach

Introduction

Choosing the Number of Clusters

Different Methods of Clustering

Evaluating Clustering

Summary

Predicting Customer Revenue Using Linear Regression

Introduction

Understanding Regression

Feature Engineering for Regression

Performing and Interpreting Linear Regression

Summary

Other Regression Techniques and Tools for Evaluation

Introduction

Evaluating the Accuracy of a Regression Model

Using Regularization for Feature Selection

Tree-Based Regression Models

Summary

Supervised Learning: Predicting Customer Churn

Introduction

Classification Problems

Understanding Logistic Regression

Creating a Data Science Pipeline

Modeling the Data

Summary

Fine-Tuning Classification Algorithms

Introduction

Support Vector Machines

Decision Trees

Random Forest

Preprocessing Data for Machine Learning Models

Model Evaluation

Performance Metrics

Summary

Modeling Customer Choice

Introduction

Understanding Multiclass Classification

Class Imbalanced Data

Summary

Appendix

Chapter 1: Data Preparation and Cleaning

Chapter 2: Data Exploration and Visualization

Chapter 3: Unsupervised Learning: Customer Segmentation

Chapter 4: Choosing the Best Segmentation Approach

Chapter 5: Predicting Customer Revenue Using Linear Regression

Chapter 6: Other Regression Techniques and Tools for Evaluation

Chapter 7: Supervised Learning: Predicting Customer Churn

Chapter 8: Fine-Tuning Classification Algorithms

Chapter 9: Modeling Customer Choice

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data Models and Structured Data

When we build an analytics pipeline, the first thing that we need to do is to build a data model. A data model is an overview of the data sources that we will be using, their relationships with other data sources, where exactly the data from a specific source is going to enter the pipeline, and in what form (such as an Excel file, a database, or a JSON from an internet source). The data model for the pipeline evolves over time as data sources and processes change. A data model can contain data of the following three types:

Structured Data: This is also known as completely structured or well-structured data. This is the simplest way to manage information. The data is arranged in a flat tabular form with the correct value corresponding to the correct attribute. There is a unique column, known as an index, for easy and quick access to the data, and there are no duplicate columns. Data can be queried exactly through SQL queries, for example, data in relational databases, MySQL, Amazon Redshift, and so on.
Semi-structured data: This refers to data that may be of variable lengths and that may contain different data types (such as numerical or categorical) in the same column. Such data may be arranged in a nested or hierarchical tabular structure, but it still follows a fixed schema. There are no duplicate columns (attributes), but there may be duplicate rows (observations). Also, each row might not contain values for every attribute, that is, there may be missing values. Semi-structured data can be stored accurately in NoSQL databases, Apache Parquet files, JSON files, and so on.
Unstructured data: Data that is unstructured may not be tabular, and even if it is tabular, the number of attributes or columns per observation may be completely arbitrary. The same data could be represented in different ways, and the attributes might not match each other, with values leaking into other parts. Unstructured data can be stored as text files, CSV files, Excel files, images, audio clips, and so on.

Marketing data, traditionally, comprises data of all three types. Initially, most data points originated from different (possibly manual) data sources, so the values for a field could be of different lengths, the value for one field would not match that of other fields because of different field names, some rows containing data from even the same sources could also have missing values for some of the fields, and so on. But now, because of digitization, structured and semi-structured data is also available and is increasingly being used to perform analytics. The following figure illustrates the data model of traditional marketing analytics comprising all kinds of data: structured data such as databases (top), semi-structured data such as JSONs (middle), and unstructured data such as Excel files (bottom):

Figure 1.1: Data model of traditional marketing analytics

A data model with all these different kinds of data is prone to errors and is very risky to use. If we somehow get a garbage value into one of the attributes, our entire analysis will go awry. Most of the times, the data we need is of a certain kind and if we don't get that type of data, we might run into a bug or problem that would need to be investigated. Therefore, if we can enforce some checks to ensure that the data being passed to our model is almost always of the same kind, we can easily improve the quality of data from unstructured to at least semi-structured.

This is where programming languages such as Python come into play. Python is an all-purpose general programming language that not only makes writing structure-enforcing scripts easy, but also integrates with almost every platform and automates data production, analysis, and analytics into a more reliable and predictable pipeline. Apart from understanding patterns and giving at least a basic structure to data, Python forces intelligent pipelines to accept the right value for the right attribute. The majority of analytics pipelines are exactly of this kind. The following figure illustrates how most marketing analytics today structure different kinds of data by passing it through scripts to make it at least semi-structured:

Figure 1.2: Data model of most marketing analytics that use Python

By making use of such structure-enforcing scripts, we will have a pipeline of semi-structured data coming in with expected values in the right fields; however, the data is not yet in the best possible format to perform analytics. If we can completely structure our data (that is, arrange it in flat tables, with the right value pointing to the right attribute with no nesting or hierarchy), it will be easy for us to see how every data point individually compares to other points being considered in the common fields, and would also make the pipeline scalable. We can easily get a feel of the data—that is, see in what range most values lie, identify the clear outliers, and so on—by simply scrolling through the data.

While there are a lot of tools that can be used to convert data from an unstructured/semi-structured format to a fully structured format (for example, Spark, STATA, and SAS), the tool that is most commonly used for data science, can be integrated with practically any framework, has rich functionalities, minimal costs, and is easy-to-use for our use case, is pandas. The following figure illustrates how a data model structures different kinds of data from being possibly unstructured to semi-structured (using Python), to completely structured (using pandas):

Figure 1.3: Data model to structure the different kinds of data

Note

For the purpose of this book, we will assume that you are more or less comfortable with NumPy.

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Related Content you might be interested in

Current Title:

Data Science for Marketing Analytics

The Data Science Workshop

The Data Science Workshop

Applied Supervised Learning with Python

Data Models and Structured Data

Note