Let's get back to our retailer. Let's use what we have built so far to provide recommendations to our retailer for his cross-selling strategy.
This can be implemented using the following code:
########################################################################### # # R Data Analysis Projects # # Chapter 1 # # Building Recommender System # A step step approach to build Association Rule Mining # # # Script: # Generating rules for cross sell campaign. # # # Gopi Subramanian ###########################################################################
library(arules) library(igraph)
get.txn <- function(data.path, columns){ # Get transaction object for a given data file # # Args: # data.path: data file name location # columns: transaction id and item id columns. # # Returns: # transaction object transactions.obj <- read.transactions(file = data.path, format = "single", sep = ",", cols = columns, rm.duplicates = FALSE, quote = "", skip = 0, encoding = "unknown") return(transactions.obj) }
get.rules <- function(support, confidence, transactions){ # Get Apriori rules for given support and confidence values # # Args: # support: support parameter # confidence: confidence parameter # # Returns: # rules object parameters = list( support = support, confidence = confidence, minlen = 2, # Minimal number of items per item set maxlen = 10, # Maximal number of items per item set target = "rules" ) rules <- apriori(transactions, parameter = parameters) return(rules) }
find.rules <- function(transactions, support, confidence, topN = 10){ # Generate and prune the rules for given support confidence value # # Args: # transactions: Transaction object, list of transactions # support: Minimum support threshold # confidence: Minimum confidence threshold # Returns: # A data frame with the best set of rules and their support and confidence values # Get rules for given combination of support and confidence all.rules <- get.rules(support, confidence, transactions) rules.df <-data.frame(rules = labels(all.rules) , all.rules@quality) other.im <- interestMeasure(all.rules, transactions = transactions) rules.df <- cbind(rules.df, other.im[,c('conviction','leverage')]) # Keep the best rule based on the interest measure best.rules.df <- head(rules.df[order(-rules.df$leverage),],topN) return(best.rules.df) }
plot.graph <- function(cross.sell.rules){ # Plot the associated items as graph # # Args: # cross.sell.rules: Set of final rules recommended # Returns: # None edges <- unlist(lapply(cross.sell.rules['rules'], strsplit, split='=>')) g <- graph(edges = edges) plot(g) }
support <- 0.01 confidence <- 0.2
columns <- c("order_id", "product_id") ## columns of interest in data file data.path = '../../data/data.csv' ## Path to data file
transactions.obj <- get.txn(data.path, columns) ## create txn object
cross.sell.rules <- find.rules( transactions.obj, support, confidence ) cross.sell.rules$rules <- as.character(cross.sell.rules$rules)
plot.graph(cross.sell.rules)
After exploring the dataset for support and confidence values, we set the support and confidence values as 0.001 and 0.2 respectively.
We have written a function called find.rules
. It internally calls get.rules
. This function returns the list of top N rules given the transaction and support/confidence thresholds. We are interested in the top 10 rules. As discussed, we are going to use lift values for our recommendation. The following are our top 10 rules:
rules support confidence lift conviction leverage 59 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.03219805 0.3086420 1.900256 1.211498 0.01525399 63 {Organic Strawberries} => {Bag of Organic Bananas} 0.03577562 0.2753304 1.695162 1.155808 0.01467107 64 {Bag of Organic Bananas} => {Organic Strawberries} 0.03577562 0.2202643 1.695162 1.115843 0.01467107 52 {Limes} => {Large Lemon} 0.01846022 0.2461832 3.221588 1.225209 0.01273006 53 {Large Lemon} => {Limes} 0.01846022 0.2415730 3.221588 1.219648 0.01273006 51 {Organic Raspberries} => {Bag of Organic Bananas} 0.02318260 0.3410526 2.099802 1.271086 0.01214223 50 {Organic Raspberries} => {Organic Strawberries} 0.02003434 0.2947368 2.268305 1.233671 0.01120205 40 {Organic Yellow Onion} => {Organic Garlic} 0.01431025 0.2525253 4.084830 1.255132 0.01080698 41 {Organic Garlic} => {Organic Yellow Onion} 0.01431025 0.2314815 4.084830 1.227467 0.01080698 58 {Organic Hass Avocado} => {Organic Strawberries} 0.02432742 0.2331962 1.794686 1.134662 0.01077217
The first entry has a lift value of 1.9, indicating that the products are not independent. This rule has a support of 3 percent and the system has 30 percent confidence for this rule. We recommend that the retailer uses these two products in his cross-selling campaign as, given the lift value, there is a high probability of the customer picking up a {Bag of Organic Bananas}
if he picks up an {Organic Hass Avocado}
.
Curiously, we have also included two other interest measures—conviction and leverage.
How many more units of A and B are expected to be sold together than expected from individual sales? With lift, we said that there is a high association between the {Bag of Organic Bananas}
and {Organic Hass Avocado}
products. With leverage, we are able to quantify in terms of sales how profitable these two products would be if sold together. The retailer can expect 1.5 more unit sales by selling the {Bag of Organic Bananas}
and the {Organic Hass Avocado}
together rather than selling them individually. For a given rule A => B
:
Leverage(A => B) = Support(A => B) - Support(A)*Support(B)
Leverage measures the difference between A
and B
appearing together in the dataset and what would be expected if A
and B
were statistically dependent.
Conviction is a measure to ascertain the direction of the rule. Unlike lift, conviction is sensitive to the rule direction. Conviction (A => B
) is not the same as conviction (B => A
).
For a rule A => B
:
conviction ( A => B) = 1 - support(B) / 1 - confidence( A => B)
Conviction, with the sense of its direction, gives us a hint that targeting the customers of Organic Hass Avocado
to cross-sell will yield more sales of Bag of Organic Bananas
rather than the other way round.
Thus, using lift, leverage, and conviction, we have provided all the empirical details to our retailer to design his cross-selling campaign. In our case, we have recommended the top 10 rules to the retailer based on leverage. To provide the results more intuitively and to indicate what items could go together in a cross-selling campaign, a graph visualization of the rules can be very appropriate.
The plot.graph
function is used to visualize the rules that we have shortlisted based on their leverage values. It internally uses a package called igraph
to create a graph representation of the rules:
Our suggestion to the retailer can be the largest subgraph on the left. Items in that graph can be leveraged for his cross-selling campaign. Depending on the profit margin and other factors, the retailer can now design his cross-selling campaign using the preceding output.