Book Image

Data Science for Marketing Analytics

By : Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar
Book Image

Data Science for Marketing Analytics

By: Tommy Blanchard, Debasish Behera, Pranshu Bhatnagar

Overview of this book

Data Science for Marketing Analytics covers every stage of data analytics, from working with a raw dataset to segmenting a population and modeling different parts of the population based on the segments. The book starts by teaching you how to use Python libraries, such as pandas and Matplotlib, to read data from Python, manipulate it, and create plots, using both categorical and continuous variables. Then, you'll learn how to segment a population into groups and use different clustering techniques to evaluate customer segmentation. As you make your way through the chapters, you'll explore ways to evaluate and select the best segmentation approach, and go on to create a linear regression model on customer value data to predict lifetime value. In the concluding chapters, you'll gain an understanding of regression techniques and tools for evaluating regression models, and explore ways to predict customer choice using classification algorithms. Finally, you'll apply these techniques to create a churn model for modeling customer product choices. By the end of this book, you will be able to build your own marketing reporting and interactive dashboard solutions.
Table of Contents (12 chapters)
Data Science for Marketing Analytics
Preface

Chapter 1: Data Preparation and Cleaning


Activity 1: Addressing Data Spilling

  1. Import pandas and copy into the console, as follows:

    import pandas as pd
    import copy
  2. Use the read_excel function to read the xlsx file and the head function to look at the first few rows:

    sales = pd.read_excel("sales.xlsx")
    sales.head()
  3. Look at the data types of sales and see if they make more sense:

    sales.dtypes

    You should get the following output:

    Figure 1.57: Looking at the datatype of sales.xlsx

  4. We can iterate through the DataFrame rows to understand how the data is in the first row and how it should be:

    forlabel,content in sales.iteritems():
        print label, content[1]

    This gives the following output:

    Figure 1.58: Iterating through the first row

  5. From the preceding output, you can infer that the data should actually contain column names starting with a capital letter. Add the following code to get an overview of what the data should actually look like:

    d = {"Year": 2004, "Product line": "Camping Equipment", "Product type":"Cooking Gear", "Product":"TrailChef Water Bag", 
         "Order method type":"Telephone", "Retailer Country":"Canada", "Revenue":13444.68, "Planned revenue":14313.48, 
         "Product cost":6298.8, "Quantity":2172, "Unit cost":2.9, "Unit price":6.59, "Gross Profit":7145.88, "Unit sale price":6.19}

    You should get the following output:

    Figure 1.59: Looking at how the data should be structured

  6. The Year column seems to be in its right place, so we can start by creating the Product line column. Let's see how many values it has leaked into:

    sales.groupby(['Product','line'])['Year'].count()

    Your output will be as follows:

    Figure 1.60: Seeing how values are distributed across the columns

  7. We are not sure whether the spillage is only restricted to these two columns, so let's look at the next column too, to be sure:

    sales.groupby(['Product','line', 'Product.1'])['Year'].count()

    Your output will be as follows:

    Figure 1.61: Looking at data to get the number of columns required

  8. Let's resolve the Product line column and see if we were able to do anything:

    sales['Product line'] = sales.apply(lambda x: x['Product'] +''+ x['line'], axis = 1)
    sales = sales.drop(['Product', 'line'], axis = 1)
    sales.head()

    The DataFrame should now look as follows:

    Figure 1.62: Collecting data from multiple columns into the correct field

  9. Instead of directly changing the sales DataFrame, let's create a copy of it, called tmp, and make changes to it. Once we have finalized the filtering procedure, we don't want to lose the original DataFrame. Let's continue looking at the next column of interest, that is, Product type:

    tmp = copy.deepcopy(sales)
    tmp.groupby(['Product.1','type', 'Product.2'])['Year'].count()

    This gives the following output:

    Figure 1.63: Seeing variation in number of words required to represent product type

    We can see that some fields are not the fields we are interested in. Let's limit our view to only those that we are interested in.

  10. As we are only interested in the Climbing Accessories, Cooking Gear, First Aid, Golf Accessories, Insect Repellents, and Sleeping Bags product types, so let's filter them out. Using a similar logic as before, we store these columns in a new DataFrame, tmp1:

    tmp1 = copy.deepcopy(tmp[tmp['Product.1'].isin(['Climbing', 'Cooking', 'First', 'Golf', 'Insect', 'Sleeping'])])
    tmp1.head()

    This gives the following output:

    Figure 1.64: Filtering out the categories we need

    Then perform the following groupby operations as well:

    tmp1.groupby(['Product.1')['Year'].count()
    tmp1.groupby(['Product.1', 'type', 'Product.2'])['Year'].count()

    This gives the following output:

    Figure 1.65: Looking for the variation in the required product types categories

    We can see that—because of our choice of fields—luckily, we only have to append two columns to get the attribute we need.

  11. Now that we know tmp1 requires only two words for Product type, we are done:

    tmp1['Product type'] = tmp1['Product.1'] + ''+ tmp1['type']
    tmp1 = tmp1.drop(['Product.1', 'type'], axis = 1)
    tmp1.head()

    This gives the following output:

    Figure 1.66: Joining data with these required categories

  12. The next column we have to worry about is Product. Let's see how many columns we need for that:

    tmp1.groupby(['Product.2', 'Order'])['Year'].count()

    This gives the following output:

    Figure 1.67: Looking at the variation in the product category

    This column has some values that are one word and some that are more than that.

  13. Let's create another variable, tmp2, for values containing more than one word:

    tmp2 = copy.deepcopy(tmp1[~tmp1['Order'].isin(['E-mail', 'Fax', 'Mail', 'Sales', 'Special', 'Telephone', 'Web'])])
    tmp2.head()

    This gives the following output:

    Figure 1.68: Filtering out observations with values containing more than one word

  14. We look at the variation in tmp2 and see that, while most of the columns have two words, some columns have more than two words:

    tmp2.groupby(['Product.2','Order', 'method'])['Year'].count()

    This gives the following output:

    Figure 1.69: Distribution in the next few fields

  15. We keep on performing this procedure of repeatedly storing values that have more leakage into another variable, until we have exhausted all the columns and got a structured dataset by the last step:

    tmp8 = copy.deepcopy(tmp7[tmp7['Product.3'].isin(['Kingdom', 'States'])])
    tmp8.head()

    This gives the following output:

    Figure 1.70: Distribution after structuring the longest parts

    After structuring the fields with the longest names for Products, Order method type, and Retailer country, we can see that there is no spillage in the columns containing numerical values. Let's handle the remaining cases directly.

  16. We finish structuring the last part of the data directly and store the final structured data separately:

    tmp8['Retail country'] = tmp8['revenue'] + '' + tmp8['Product.3']
    tmp8 = tmp8.drop(['revenue', 'Product.3'], axis = 1)
    tmp8["Revenue"] = tmp8['cost']
    tmp8 = tmp8.drop(['cost'], axis = 1)
    tmp8["Planned revenue"] = tmp8['Quantity'] 
    tmp8 = tmp8.drop(['Quantity'], axis = 1)
    tmp8["Product cost"] = tmp8['Unit'] 
    tmp8 = tmp8.drop(['Unit'], axis = 1)
    tmp8["Quantity"] = tmp8['cost.1'] 
    tmp8 = tmp8.drop(['cost.1'], axis = 1)
    tmp8["Unit cost"] = tmp8['Unit.1']
    tmp8 = tmp8.drop(['Unit.1'], axis = 1)
    tmp8["Unit price"] = tmp8['price']
    tmp8 = tmp8.drop(['price'], axis = 1)
    tmp8["Gross profit"] = tmp8['Gross']
    tmp8 = tmp8.drop(['Gross'], axis = 1)
    tmp8["Unit sale price"] = tmp8['profit']
    tmp8 = tmp8.drop(['profit', 'Unit.2', 'sale', 'price.1'], axis = 1)
    tmp8.head()

    This gives the following output:

    Figure 1.71: Fully structuring the data for the longest data

    We store the structured dataset separately as str1:

    str1 = tmp8
  17. Once we structure the last layer of the data completely like this, it is easier for us to structure the layer just before it, as we can use the knowledge of the previous layer and reduce structuring the current layer to a problem we have already solved before:

    temp = copy.deepcopy(tmp7[~tmp7.index.isin(tmp8.index.values)])
    temp.head()

    You should get the following output:

    Figure 1.72: Structuring data with the second longest data fields

  18. Now, we keep on going backward and structure the entire dataset with the help of the correctly structured preceding layers:

    temp1['Retailer country'] = temp1['method']
    temp1 = temp1.drop(['method'], axis = 1)
    \temp1["Planned revenue"] = temp1['Retailer'] 
    temp1 = temp1.drop(['Retailer'], axis = 1)
    temp1["Product cost"] = temp1['country'] 
    temp1 = temp1.drop(['country'], axis = 1)
    temp1["Unit cost"] = temp1['Planned'] 
    temp1 = temp1.drop(['Planned'], axis = 1)
    temp1["Unit price"] = temp1['revenue'] 
    temp1 = temp1.drop(['revenue'], axis = 1)
    temp1["Gross profit"] = temp1['Product.3']
    temp1 = temp1.drop(['Product.3'], axis = 1)
    temp1["Unit sale price"] = temp1['cost']
    temp1 = temp1.drop(['cost'], axis = 1)
    temp1["Quantity"]  = temp1['Revenue']
    temp1 = temp1.drop(['Revenue'], axis = 1)
    temp1['Revenue'] = temp1['type.1']
    temp1 = temp1.drop('type.1', axis = 1)
    temp1 = temp1[['Year', 'Product line', 'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 
    'Planned revenue', 'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]
    
    temp1.head()

    This gives the following output:

    Figure 1.73: Iteratively going backward while stabilizing the longest fields

  19. Finally, we just make sure to combine our data and we are done:

    df = pd.concat([str1, str2, str3, str4, str5, str6, str7, str8, str9, str10, str11, str12, str13, str14, 
                   str15, str16, str17, str18, str19, str20, str21, str22], sort = True)[['Year', 'Product line',
    'Product type', 'Product', 'Order method type', 'Retailer country', 'Revenue', 'Planned revenue',
    'Product cost', 'Quantity', 'Unit cost', 'Unit price', 'Gross profit', 'Unit sale price']]
    df.groupby('Product type').count()

    This gives the following output:

    Figure 1.74: Combining Data Sources