Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Introduction to Market Segmentation


Market segmentation is dividing customers into different segments based on common characteristics. The following are the uses of customer segmentation:

  • Increasing customer conversion and retention

  • Developing new products for a particular segment by identifying it and its needs

  • Improving brand communication with a particular segment

  • Identifying gaps in marketing strategy and making new marketing strategies to increase sales

Exercise 4: Exploring the Wholesale Customer Dataset

In this exercise, we will have a look at the data in the wholesale customer dataset.

Note

For all the exercises and activities where we are importing an external CSV or image files, go to RStudio-> Session-> Set Working Directory-> To Source File Location. You can see in the console that the path is set automatically.

  1. To download the CSV file, go to https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Exercise04/wholesale_customers_data.csv. Click on wholesale_customers_data.csv.

    Note

    This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/00292/. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Exercise04/wholesale_customers_data.csv.

  2. Save it to the folder in which you have installed R. Now, to load it in R, use the following function:

    ws<-read.csv("wholesale_customers_data.csv")
  3. Now we may have a look at the different columns and rows in this dataset by using the following function in R:

    head(ws)

    The output is as follows:

    Figure 1.19: Columns of the wholesale customer dataset

These six rows show the first six rows of annual spending in monetary units by category of product.

Activity 2: Customer Segmentation with k-means

For this activity, we're going to use the wholesale customer dataset from the UCI Machine Learning Repository. It's available at: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv. We're going to identify customers belonging to different market segments who like to spend on different types of goods with clustering. Try k-means clustering for values of k from 2 to 6.

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/00292/. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.

These steps will help you complete the activity:

  1. Read data downloaded from the UCI Machine Learning Repository into a variable. The data can be found at: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.

  2. Select only two columns, Grocery and Frozen, for easy visualization of clusters.

  3. As in Step 2 of Exercise 4, Exploring the Wholesale Customer Dataset, change the value for the number of clusters to 2 and generate the cluster centers.

  4. Plot the graph as in Step 4 in Exercise 4, Exploring the Wholesale Customer Dataset.

  5. Save the graph you generate.

  6. Repeat Steps 3, 4, and 5 by changing value for the number of clusters to 3, 4, 5, and 6.

  7. Decide which value for the number of clusters best classifies the dataset.

The output will be chart of six clusters as follows:

Figure 1.20: The expected chart for six clusters

Note

The solution for this activity can be found on page 201.