Book Image

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger
Book Image

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Table of Contents (12 chapters)
Applied Unsupervised Learning with Python
Preface

Chapter 8: Market Basket Analysis


Activity 18: Loading and Preparing Full Online Retail Data

Solution:

  1. Load the online retail dataset file:

    import matplotlib.pyplot as plt
    import mlxtend.frequent_patterns
    import mlxtend.preprocessing
    import numpy
    import pandas
    
    online = pandas.read_excel(
        io="Online Retail.xlsx", 
        sheet_name="Online Retail", 
        header=0
    )
  2. Clean and prep the data for modeling, including turning the cleaned data into a list of lists:

    online['IsCPresent'] = (
        online['InvoiceNo']
        .astype(str)
        .apply(lambda x: 1 if x.find('C') != -1 else 0)
    )
    
    online1 = (
        online
        .loc[online["Quantity"] > 0]
        .loc[online['IsCPresent'] != 1]
        .loc[:, ["InvoiceNo", "Description"]]
        .dropna()
    )
    
    invoice_item_list = []
    for num in list(set(online1.InvoiceNo.tolist())):
        tmp_df = online1.loc[online1['InvoiceNo'] == num]
        tmp_items = tmp_df.Description.tolist()
        invoice_item_list.append(tmp_items)
  3. Encode the data and recast it as a DataFrame:

    online_encoder = mlxtend.preprocessing.TransactionEncoder()
    online_encoder_array = online_encoder.fit_transform(invoice_item_list)
    
    online_encoder_df = pandas.DataFrame(
        online_encoder_array, 
        columns=online_encoder.columns_
    )
    
    online_encoder_df.loc[
        20125:20135, 
        online_encoder_df.columns.tolist()[100:110]
    ]

    The output is as follows:

    Figure 8.35: A subset of the cleaned, encoded, and recast DataFrame built from the complete online retail dataset

Activity 19: Apriori on the Complete Online Retail Dataset

Solution:

  1. Run the Apriori algorithm on the full data with reasonable parameter settings:

    mod_colnames_minsupport = mlxtend.frequent_patterns.apriori(
        online_encoder_df, 
        min_support=0.01,
        use_colnames=True
    )
    mod_colnames_minsupport.loc[0:6]

    The output is as follows:

    Figure 8.36: The Apriori algorithm results using the complete online retail dataset

  2. Filter the results down to the item set containing 10 COLOUR SPACEBOY PEN. Compare the support value with that under Exercise 44, Executing the Apriori algorithm:

    mod_colnames_minsupport[
        mod_colnames_minsupport['itemsets'] == frozenset(
            {'10 COLOUR SPACEBOY PEN'}
        )
    ]

    The output is as follows:

    Figure 8.37: Result of item set containing 10 COLOUR SPACEBOY PEN

    The support value does change. When the dataset is expanded to include all transactions, the support for this item set increases from 0.015 to 0.015793. That is, in the reduced dataset used for the exercises, this item set appears in 1.5% of the transactions, while in the full dataset, it appears in approximately 1.6% of transactions.

  3. Add another column containing the item set length. Then, filter down to those item sets whose length is two and whose support is in the range [0.02, 0.021]. Are the item sets the same as those found in Exercise 44, Executing the Apriori algorithm, Step 6?

    mod_colnames_minsupport['length'] = (
        mod_colnames_minsupport['itemsets'].apply(lambda x: len(x))
    )
    
    mod_colnames_minsupport[
        (mod_colnames_minsupport['length'] == 2) & 
        (mod_colnames_minsupport['support'] >= 0.02) &
        (mod_colnames_minsupport['support'] < 0.021)
    ]

    Figure 8.38: The section of the results of filtering based on length and support

    The results did change. Before even looking at the particular item sets and their support values, we see that this filtered DataFrame has fewer item sets than the DataFrame in the previous exercise. When we use the full dataset, there are fewer item sets that match the filtering criteria; that is, only 14 item sets contain 2 items and have a support value greater than or equal to 0.02, and less than 0.021. In the previous exercise, 17 item sets met these criteria.

  4. Plot the support values:

    mod_colnames_minsupport.hist("support", grid=False, bins=30)
    plt.title("Support")

    Figure 8.39: The distribution of support values

This plot shows the distribution of support values for the full transaction dataset. As you might have assumed, the distribution is right skewed; that is, most of the item sets have lower support values and there is a long tail of support values on the higher end of the spectrum. Given how many unique item sets exist, it is not surprising that no single item set appears in a high percentage of the transactions. With this information, we could tell management that even the most prominent item set only appears in approximately 10% of the transactions, and that the vast majority of item sets appear in less than 2% of transactions. These results may not support changes in store layout, but could very well inform pricing and discounting strategies. We would gain more information on how to build these strategies by formalizing some association rules.

Activity 20: Finding the Association Rules on the Complete Online Retail Dataset

Solution:

  1. Fit the association rule model on the full dataset. Use metric confidence and a minimum threshold of 0.6:

    rules = mlxtend.frequent_patterns.association_rules(
        mod_colnames_minsupport, 
        metric="confidence",
        min_threshold=0.6, 
        support_only=False
    )
    rules.loc[0:6]

    The output is as follows:

    Figure 8.40: The association rules based on the complete online retail dataset

  2. Count the number of association rules. Is the number different to that found in Exercise 45, Deriving Association Rules, Step 1?

    print("Number of Associations: {}".format(rules.shape[0]))

    There are 498 association rules.

  3. Plot confidence against support:

    rules.plot.scatter("support", "confidence", alpha=0.5, marker="*")
    plt.xlabel("Support")
    plt.ylabel("Confidence")
    plt.title("Association Rules")
    plt.show()

    The output is as follows:

    Figure 8.41: The plot of confidence against support

    The plot reveals that there are some association rules featuring relatively high support and confidence values for this dataset.

  4. Look at the distributions of lift, leverage, and conviction:

    rules.hist("lift", grid=False, bins=30)
    plt.title("Lift")

    The output is as follows:

    Figure 8.42: The distribution of lift values

    rules.hist("leverage", grid=False, bins=30)
    plt.title("Leverage")

    The output is as follows:

    Figure 8.43: The distribution of leverage values

    plt.hist(
        rules[numpy.isfinite(rules['conviction'])].conviction.values, 
        bins = 30
    )
    plt.title("Conviction")

    The output is as follows:

    Figure 8.44: The distribution of conviction values

Having derived association rules, we can return to management with additional information, the most important of which would be that there are roughly seven item sets that have reasonably high values for both support and confidence. Look at the scatterplot of confidence against support to see the seven item sets that are separated from all the others. These seven item sets also have high lift values, as can be seen in the lift histogram. It seems that we have identified some actionable association rules, rules that we can use to drive business decisions.