Book Image

R Data Analysis Projects

Book Image

R Data Analysis Projects

Overview of this book

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it’s one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. This book will demonstrate how you can put to use your existing knowledge of data analysis in R to build highly efficient, end-to-end data analysis pipelines without any hassle. You’ll start by building a content-based recommendation system, followed by building a project on sentiment analysis with tweets. You’ll implement time-series modeling for anomaly detection, and understand cluster analysis of streaming data. You’ll work through projects on performing efficient market data research, building recommendation systems, and analyzing networks accurately, all provided with easy to follow codes. With the help of these real-world projects, you’ll get a better understanding of the challenges faced when building data analysis pipelines, and see how you can overcome them without compromising on the efficiency or accuracy of your systems. The book covers some popularly used R packages such as dplyr, ggplot2, RShiny, and others, and includes tips on using them effectively. By the end of this book, you’ll have a better understanding of data analysis with R, and be able to put your knowledge to practical use without any hassle.
Table of Contents (15 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

The cross-selling campaign


Let's get back to our retailer. Let's use what we have built so far to provide recommendations to our retailer for his cross-selling strategy.

This can be implemented using the following code:

###########################################################################
 #
 # R Data Analysis Projects
 #
 # Chapter 1
 #
 # Building Recommender System
 # A step step approach to build Association Rule Mining
 #
 #
 # Script:
 # Generating rules for cross sell campaign.
 #
 #
 # Gopi Subramanian
 ###########################################################################
library(arules)
library(igraph)
get.txn <- function(data.path, columns){
 # Get transaction object for a given data file
 #
 # Args:
 # data.path: data file name location
 # columns: transaction id and item id columns.
 #
 # Returns:
 # transaction object
 transactions.obj <- read.transactions(file = data.path, format = "single",
 sep = ",",
 cols = columns,
 rm.duplicates = FALSE,
 quote = "", skip = 0,
 encoding = "unknown")
 return(transactions.obj)
 } 
get.rules <- function(support, confidence, transactions){
 # Get Apriori rules for given support and confidence values
 #
 # Args:
 # support: support parameter
 # confidence: confidence parameter
 #
 # Returns:
 # rules object
 parameters = list(
 support = support,
 confidence = confidence,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "rules"

 )

 rules <- apriori(transactions, parameter = parameters)
 return(rules)
 }
find.rules <- function(transactions, support, confidence, topN = 10){
 # Generate and prune the rules for given support confidence value
 #
 # Args:
 # transactions: Transaction object, list of transactions
 # support: Minimum support threshold
 # confidence: Minimum confidence threshold
 # Returns:
 # A data frame with the best set of rules and their support and confidence values


 # Get rules for given combination of support and confidence
 all.rules <- get.rules(support, confidence, transactions)

 rules.df <-data.frame(rules = labels(all.rules)
 , all.rules@quality)

 other.im <- interestMeasure(all.rules, transactions = transactions)

 rules.df <- cbind(rules.df, other.im[,c('conviction','leverage')])


 # Keep the best rule based on the interest measure
 best.rules.df <- head(rules.df[order(-rules.df$leverage),],topN)

 return(best.rules.df)
 }
plot.graph <- function(cross.sell.rules){
 # Plot the associated items as graph
 #
 # Args:
 # cross.sell.rules: Set of final rules recommended
 # Returns:
 # None
 edges <- unlist(lapply(cross.sell.rules['rules'], strsplit, split='=>'))

 g <- graph(edges = edges)
 plot(g)

 }
support <- 0.01
confidence <- 0.2
columns <- c("order_id", "product_id") ## columns of interest in data file
 data.path = '../../data/data.csv' ## Path to data file
transactions.obj <- get.txn(data.path, columns) ## create txn object
cross.sell.rules <- find.rules( transactions.obj, support, confidence )
 cross.sell.rules$rules <- as.character(cross.sell.rules$rules)
plot.graph(cross.sell.rules)

After exploring the dataset for support and confidence values, we set the support and confidence values as 0.001 and 0.2 respectively.

We have written a function called find.rules. It internally calls get.rules. This function returns the list of top N rules given the transaction and support/confidence thresholds. We are interested in the top 10 rules. As discussed, we are going to use lift values for our recommendation. The following are our top 10 rules:

  rules support confidence lift conviction leverage
 59 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.03219805 0.3086420 1.900256 1.211498 0.01525399
 63 {Organic Strawberries} => {Bag of Organic Bananas} 0.03577562 0.2753304 1.695162 1.155808 0.01467107
 64 {Bag of Organic Bananas} => {Organic Strawberries} 0.03577562 0.2202643 1.695162 1.115843 0.01467107
 52 {Limes} => {Large Lemon} 0.01846022 0.2461832 3.221588 1.225209 0.01273006
 53 {Large Lemon} => {Limes} 0.01846022 0.2415730 3.221588 1.219648 0.01273006
 51 {Organic Raspberries} => {Bag of Organic Bananas} 0.02318260 0.3410526 2.099802 1.271086 0.01214223
 50 {Organic Raspberries} => {Organic Strawberries} 0.02003434 0.2947368 2.268305 1.233671 0.01120205
 40 {Organic Yellow Onion} => {Organic Garlic} 0.01431025 0.2525253 4.084830 1.255132 0.01080698
 41 {Organic Garlic} => {Organic Yellow Onion} 0.01431025 0.2314815 4.084830 1.227467 0.01080698
 58 {Organic Hass Avocado} => {Organic Strawberries} 0.02432742 0.2331962 1.794686 1.134662 0.01077217

The first entry has a lift value of 1.9, indicating that the products are not independent. This rule has a support of 3 percent and the system has 30 percent confidence for this rule. We recommend that the retailer uses these two products in his cross-selling campaign as, given the lift value, there is a high probability of the customer picking up a {Bag of Organic Bananas} if he picks up an {Organic Hass Avocado}.

Curiously, we have also included two other interest measures—conviction and leverage.

Leverage 

How many more units of A and B are expected to be sold together than expected from individual sales? With lift, we said that there is a high association between the {Bag of Organic Bananas} and {Organic Hass Avocado} products. With leverage, we are able to quantify in terms of sales how profitable these two products would be if sold together.  The retailer can expect 1.5 more unit sales by selling the {Bag of Organic Bananas} and the {Organic Hass Avocado}  together rather than selling them individually. For a given rule A => B:

Leverage(A => B) = Support(A => B) - Support(A)*Support(B)

Leverage measures the difference between A and B appearing together in the dataset and what would be expected if A and B were statistically dependent.

Conviction

Conviction is a measure to ascertain the direction of the rule. Unlike lift, conviction is sensitive to the rule direction. Conviction (A => B) is not the same as conviction (B => A).

For a rule A => B:

conviction ( A => B) = 1 - support(B) / 1 - confidence( A => B)

Conviction, with the sense of its direction, gives us a hint that targeting the customers of Organic Hass Avocado to cross-sell will yield more sales of Bag of Organic Bananas rather than the other way round.

Thus, using lift, leverage, and conviction, we have provided all the empirical details to our retailer to design his cross-selling campaign. In our case, we have recommended the top 10 rules to the retailer based on leverage. To provide the results more intuitively and to indicate what items could go together in a cross-selling campaign, a graph visualization of the rules can be very appropriate.

The plot.graph function is used to visualize the rules that we have shortlisted based on their leverage values. It internally uses a package called igraph to create a graph representation of the rules:

Our suggestion to the retailer can be the largest subgraph on the left. Items in that graph can be leveraged for his cross-selling campaign. Depending on the profit margin and other factors, the retailer can now design his cross-selling campaign using the preceding output.