Book Image

R Data Analysis Projects

Book Image

R Data Analysis Projects

Overview of this book

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it’s one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. This book will demonstrate how you can put to use your existing knowledge of data analysis in R to build highly efficient, end-to-end data analysis pipelines without any hassle. You’ll start by building a content-based recommendation system, followed by building a project on sentiment analysis with tweets. You’ll implement time-series modeling for anomaly detection, and understand cluster analysis of streaming data. You’ll work through projects on performing efficient market data research, building recommendation systems, and analyzing networks accurately, all provided with easy to follow codes. With the help of these real-world projects, you’ll get a better understanding of the challenges faced when building data analysis pipelines, and see how you can overcome them without compromising on the efficiency or accuracy of your systems. The book covers some popularly used R packages such as dplyr, ggplot2, RShiny, and others, and includes tips on using them effectively. By the end of this book, you’ll have a better understanding of data analysis with R, and be able to put your knowledge to practical use without any hassle.
Table of Contents (15 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Weighted association rule mining


You were able to provide recommendations to the retailer to design his cross-selling campaign. As you discuss these results with the retailer, you are faced with the following question:

Our output up until now has been great. How can I add some additional lists of products to my campaign?

You are shocked; the data scientist in you wants to do everything empirically. Now the retailer is asking for a hard list of products to be added to the campaign. How do you fit them in?

The analysis output does not include these products. None of our top rules recommend these products. They are not very frequently sold items. Hence, they are not bubbling up in accordance with the rules.

The retailer is insistent: "The products I want to add to the campaign are high margin items. Having them in the campaign will boost my yields."

Voila! The retailer is interested in high-margin items. Let's pull another trick of the trade—weighted association rule mining.

Jubilant, you reply, "Of course I can accommodate these items. Not just them, but also your other high-valued items. I see that you are interested in your margins; can you give me the margin of all your transactions? I can redo the analysis with this additional information and provide you with the new results. Shouldn't take much time."

The retailer is happy. "Of course; here is my margin for the transactions."

Let's introduce weighted association rule mining with an example from the seminal paper, Weighted Association Rules: Models and Algorithms by G.D.Ramkumar et al.

Caviar is an expensive and hence a low support item in any supermarket basket. Vodka, on the other hand, is a high to medium support item. The association, caviar => vodka is of very high confidence but will never be derived by the existing methods as the {caviar, vodka} itemset is of low support and will not be included.

The preceding paragraph echoes our retailer's concern. With the additional information about the margin for our transactions, we can now use weighted association rule mining to arrive at our new set of recommendations:

"transactionID","weight"
 "1001861",0.59502283788534
 "1003729",0.658379205926458
 "1003831",0.635451491097042
 "1003851",0.596453384749423
 "1004513",0.558612727312164
 "1004767",0.557096300448959
 "1004795",0.693775098285732
 "1004797",0.519395513963845
 "1004917",0.581376662057422

The code for the same is as follows:

########################################################################
 #
 # R Data Analysis Projects
 #
 # Chapter 1
 #
 # Building Recommender System
 # A step step approach to build Association Rule Mining
 #
 # Script:
 #
 # RScript to explain weighted Association rule mining
 #
 # Gopi Subramanian
 #########################################################################
library(arules)
 library(igraph)
get.txn <- function(data.path, columns){
 # Get transaction object for a given data file
 #
 # Args:
 # data.path: data file name location
 # columns: transaction id and item id columns.
 #
 # Returns:
 # transaction object
 transactions.obj <- read.transactions(file = data.path, format = "single",
 sep = ",",
 cols = columns,
 rm.duplicates = FALSE,
 quote = "", skip = 0,
 encoding = "unknown")
 return(transactions.obj)
 }
plot.graph <- function(cross.sell.rules){
 # Plot the associated items as graph
 #
 # Args:
 # cross.sell.rules: Set of final rules recommended
 # Returns:
 # None
 edges <- unlist(lapply(cross.sell.rules['rules'], strsplit, split='=>'))

 g <- graph(edges = edges)
 plot(g)

 }
columns <- c("order_id", "product_id") ## columns of interest in data file
 data.path = '../../data/data.csv' ## Path to data file
 transactions.obj <- get.txn(data.path, columns) ## create txn object
# Update the transaction objects
 # with transaction weights
 transactions.obj@itemsetInfo$weight <- NULL
 # Read the weights file
 weights <- read.csv('../../data/weights.csv')
 transactions.obj@itemsetInfo <- weights
# Frequent item set generation
 support <- 0.01
parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets"

 )
 weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
 , weclat.itemsets@quality)
head(weclat.itemsets.df)
 tail(weclat.itemsets.df)
# Rule induction
 weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.3)
 weclat.rules.df <-data.frame(rules = labels(weclat.rules)
 , weclat.rules@quality)
 head(weclat.rules.df)
weclat.rules.df$rules <- as.character(weclat.rules.df$rules)
 plot.graph(weclat.rules.df)

In the arules package, the weclat method allows us to use weighted transactions to generate frequent itemsets based on these weights. We introduce the weights through the itemsetinfo data frame in the str(transactions.obj) transactions object:

 Formal class 'transactions' [package "arules"] with 3 slots
 ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
 .. .. ..@ i : int [1:110657] 143 167 1340 2194 2250 3082 3323 3378 3630 4109 ...
 .. .. ..@ p : int [1:6989] 0 29 38 52 65 82 102 125 141 158 ...
 .. .. ..@ Dim : int [1:2] 16793 6988
 .. .. ..@ Dimnames:List of 2
 .. .. .. ..$ : NULL
 .. .. .. ..$ : NULL
 .. .. ..@ factors : list()
 ..@ itemInfo :'data.frame': 16793 obs. of 1 variable:
 .. ..$ labels: chr [1:16793] "#2 Coffee Filters" "0% Fat Black Cherry Greek Yogurt y" "0% Fat Blueberry Greek Yogurt" "0% Fat Free Organic Milk" ...
 ..@ itemsetInfo:'data.frame': 6988 obs. of 1 variable:
 .. ..$ transactionID: chr [1:6988] "1001861" "1003729" "1003831" "1003851" ...

The third slot in the transaction object is a data frame with one column, transactionID. We create a new column called weight in that data frame and push our transaction weights:

weights <- read.csv("../../data/weights.csv")
 transactions.obj@itemsetInfo <- weights
 str(transactions.obj)

In the preceding case, we have replaced the whole data frame. You can either do that or only add the weight column.

Let's now look at the transactions object in the terminal:

Formal class 'transactions' [package "arules"] with 3 slots
 ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
 .. .. ..@ i : int [1:110657] 143 167 1340 2194 2250 3082 3323 3378 3630 4109 ...
 .. .. ..@ p : int [1:6989] 0 29 38 52 65 82 102 125 141 158 ...
 .. .. ..@ Dim : int [1:2] 16793 6988
 .. .. ..@ Dimnames:List of 2
 .. .. .. ..$ : NULL
 .. .. .. ..$ : NULL
 .. .. ..@ factors : list()
 ..@ itemInfo :'data.frame': 16793 obs. of 1 variable:
 .. ..$ labels: chr [1:16793] "#2 Coffee Filters" "0% Fat Black Cherry Greek Yogurt y" "0% Fat Blueberry Greek Yogurt" "0% Fat Free Organic Milk" ...
 ..@ itemsetInfo:'data.frame': 6988 obs. of 2 variables:
 .. ..$ transactionID: int [1:6988] 1001861 1003729 1003831 1003851 1004513 1004767 1004795 1004797 1004917 1004995 ...
 .. ..$ weight : num [1:6988] 0.595 0.658 0.635 0.596 0.559 ...

We have the transactionID and the weight in the itemsetInfo data frame now. Let's run the weighted itemset generation using these transaction weights:

support <- 0.01
parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets"
 )
weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
 , weclat.itemsets@quality

Once again, we invoke the weclat function with the parameter list and the transactions object. As the itemInfo data frame has the weight column, the function calculates the support using the weights provided. The new definition of support is as follows:

For a given item A:

Weighted support ( A ) = Sum of weights of the transactions containing A / Sum of all weights.

The weighted support of an itemset is the sum of the weights of the transactions that contain the itemset. An itemset is frequent if its weighted support is equal to or greater than the threshold specified by support (assuming that the weights, sum is equal to one).

With this new definition, you can see now that low support times established by the old definition of support, if present in high value transactions, will be included. We have automatically taken care of our retailer's request to include high margin items while inducing the rules. Once again, for better reading, we create a data frame where each row is the frequent itemset generated, and a column to indicate the head(weclat.itemsets.df)support value:

  weclat.itemsets support
1 {Bag of Organic Bananas,Organic Kiwi} 0.01041131
2 {Bag of Organic Bananas,Organic D'Anjou Pears} 0.01042194
3 {Bag of Organic Bananas,Organic Whole String Cheese} 0.01034432
4 {Organic Baby Spinach,Organic Small Bunch Celery} 0.01039107
5 {Bag of Organic Bananas,Organic Small Bunch Celery} 0.01109455
6 {Banana,Seedless Red Grapes} 0.01274448
tail(weclat.itemsets.df)
  weclat.itemsets support
77 {Banana,Organic Hass Avocado} 0.02008700
78 {Organic Baby Spinach,Organic Strawberries} 0.02478094
79 {Bag of Organic Bananas,Organic Baby Spinach} 0.02743582
80 {Banana,Organic Baby Spinach} 0.02967578
81 {Bag of Organic Bananas,Organic Strawberries} 0.03626149
82 {Banana,Organic Strawberries} 0.03065132

In the case of apriori, we used the same function to generate/induce the rules. However, in the case of weighted association rule mining, we need to call the ruleInduction function to generate rules. We pass the frequent itemsets from the previous step, the transactions object, and finally the confidence threshold. Once again, for our convenience, we create a data frame with the list of all the rules that are induced and their interest measures:

weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.3)
weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
, weclat.rules@quality)
head(weclat.rules.df)
  rules support confidence lift itemset
1 {Organic Kiwi} => {Bag of Organic Bananas} 0.01016027 0.3879781 2.388715 1
3 {Organic D'Anjou Pears} => {Bag of Organic Bananas} 0.01001717 0.3846154 2.368011 2
5 {Organic Whole String Cheese} => {Bag of Organic Bananas} 0.00930166 0.3250000 2.000969 3
11 {Seedless Red Grapes} => {Banana} 0.01302232 0.3513514 1.686293 6
13 {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.01445335 0.3825758 2.355453 7
15 {Honeycrisp Apple} => {Banana} 0.01617058 0.4248120 2.038864 8

Finally, let's use the plot.graph function to view the new set of interesting item associations:

Our new recommendation now includes some of the rare items. It is also sensitive to the profit margin of individual transactions. With these recommendations, the retailer is geared toward increasing his profitability through the cross-selling campaign.