Book Image

R Data Analysis Projects

Book Image

R Data Analysis Projects

Overview of this book

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it’s one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. This book will demonstrate how you can put to use your existing knowledge of data analysis in R to build highly efficient, end-to-end data analysis pipelines without any hassle. You’ll start by building a content-based recommendation system, followed by building a project on sentiment analysis with tweets. You’ll implement time-series modeling for anomaly detection, and understand cluster analysis of streaming data. You’ll work through projects on performing efficient market data research, building recommendation systems, and analyzing networks accurately, all provided with easy to follow codes. With the help of these real-world projects, you’ll get a better understanding of the challenges faced when building data analysis pipelines, and see how you can overcome them without compromising on the efficiency or accuracy of your systems. The book covers some popularly used R packages such as dplyr, ggplot2, RShiny, and others, and includes tips on using them effectively. By the end of this book, you’ll have a better understanding of data analysis with R, and be able to put your knowledge to practical use without any hassle.
Table of Contents (15 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Retailer use case and data


A retailer has approached us with a problem. In the coming months, he wants to boost his sales. He is planning a marketing campaign on a large scale to promote sales. One aspect of his campaign is the cross-selling strategy. Cross-selling is the practice of selling additional products to customers. In order to do that, he wants to know what items/products tend to go together. Equipped with this information, he can now design his cross-selling strategy. He expects us to provide him with a recommendation of top N product associations so that he can pick and choose among them for inclusion in his campaign.

He has provided us with his historical transaction data. The data includes his past transactions, where each transaction is uniquely identified by an order_id integer, and the list of products present in the transaction called product_id. Remember the binary matrix representation of transactions we described in the introduction section? The dataset provided by our retailer is in exactly the same format.

Let's start by reading the data provided by the retailer. The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example, we will introduce the arules R package that we will be using. In our description, we will be using the terms order/transaction, user/customer, and item/product interchangeably. We will not be describing the installation of R packages used through the code. It's assumed that the reader is well aware of the method for installing an R package and has installed those packages before using it.

This data can be downloaded from:

data.path = '../../data/data.csv'
 data = read.csv(data.path)
 head(data)
order_id product_id
 1 837080 Unsweetened Almondmilk
 2 837080 Fat Free Milk dairy
 3 837080 Turkey
 4 837080 Caramel Corn Rice Cakes
 5 837080 Guacamole Singles
 6 837080 HUMMUS 10OZ WHITE BEAN EAT WELL

The given data is in a tabular format. Every row is a tuple of order_id, representing the transaction, product_id, the item included in that transaction, and the department_id, that is the department to which the item belongs to. This is our binary data representation, which is absolutely tenable for the classical association rule mining algorithm. This algorithm is also called market basket analysis, as we are analyzing the customer's basket—the transactions. Given a large database of customer transactions where each transaction consists of items purchased by a customer during a visit, the association rule mining algorithm generates all significant association rules between the items in the database.

Note

What is an association rule?  An example from a grocery transaction would be that the association rule is a recommendation of the form {peanut butter, jelly} => { bread }. It says that, based on the transactions, it's expected that bread will most likely be present in a transaction that contains peanut butter and jelly. It's a recommendation to the retailer that there is enough evidence in the database to say that customers who buy peanut butter and jelly will most likely buy bread.

Let's quickly explore our data. We can count the number of unique transactions and the number of unique products:

library(dplyr)
data %>%
 group_by('order_id') %>%
 summarize(order.count = n_distinct(order_id))

 data %>%
 group_by('product_id') %>%
 summarize(product.count = n_distinct(product_id))
# A tibble: 1 <U+00D7> 2
 `"order_id"` order.count
 <chr> <int>
 1 order_id 6988
# A tibble: 1 <U+00D7> 2
 `"product_id"` product.count
 <chr> <int>
 1 product_id 16793

We have 6988 transactions and 16793 individual products. There is no information about the quantity of products purchased in a transaction. We have used the dplyr library to perform these aggregate calculations, which is a library used to perform efficient data wrangling on data frames.

Note

dplyr is part of tidyverse, a collection of R packages designed around a common philosophy. dplyr is a grammar of data manipulation, and provides a consistent set of methods to help solve the most common data manipulation challenges. To learn more about dplyr, refer to the following links:http://tidyverse.org/http://dplyr.tidyverse.org/

In the next section, we will introduce the association rule mining algorithm. We will explain how this method can be leveraged to generate the top N product recommendations requested by the retailer for his cross-selling campaign.