In the previous section, we saw that weighted association rule mining leads to recommendations that can increase the profitability of the retailer. Intuitively, weighted association rule mining is superior to vanilla association rule mining as the generated rules are sensitive to transactions, weights. Instead of running the plain version of the algorithm, can we run the weighted association algorithm? We were lucky enough to get transaction weights. What if the retailer did not provide us with the transaction weights? Can we infer weights from transactions? When our transaction data does not come with preassigned weights, we need some way to assign importance to those transactions. For instance, we can say that a transaction with a lot of items should be considered more important than a transaction with a single item.

The `arules`

package provides a method called **HITS** to help us do the exact same thing—infer transaction weights. **HITS** stands for **Hyperlink-induced Topic Search**—a link analysis algorithm used to rate web pages developed by John Kleinberg. According to HITS, hubs are pages with large numbers of out degrees and authority are pages with large numbers of in degrees.

### Note

According to graph theory, a graph is formed by a set of vertices and edges. Edges connect the vertices. Graphs with directed edges are called directed graphs. The in degree is the number of inward directed edges from a given vertex in a directed graph. Similarly, the out degree is the number of outward directed edges from a given vertex in a directed graph.

The rationale is that if a lot of pages link to a particular page, then that page is considered an authority. If a page has links to a lot of other pages, it is considered a hub.

### Note

The paper, *Mining Weighted Association Rules without Preassigned Weights* by Ke Sun and Fengshan Bai, details the adaptation of the HITS algorithm for transaction databases.

The basic idea behind using the HITS algorithm for association rule mining is that frequent items may not be as important as they seem. The paper presents an algorithm that applies the HITS methodology to bipartite graphs.

### Note

In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V. Vertex sets are usually called the parts of the graph. - Wikipedia (https://en.wikipedia.org/wiki/Bipartite_graph)

According to the HITS modified for the transactions database, the transactions and products are treated as a bipartite graph, with an arc going from the transaction to the product if the product is present in the transaction. The following diagram is reproduced from the paper, *Mining Weighted Association Rules without Preassigned Weights* by Ke Sun and Fengshan Bai, to illustrate how transactions in a database are converted to a bipartite graph:

In this representation, the transactions can be ranked using the HITS algorithm. In this kind of representation, the support of an item is proportional to its degree. Consider item **A**. Its absolute support is 4; in the graph, the in degree of **A** is four. As you can see, considering only the support, we totally ignore transaction importance, unless the importance of the transactions is explicitly provided. How do we get the transaction weights in this scenario? Again, a way to get the weights intuitively is from a good transaction, which should be highly weighted and should contain many good items. A good item should be contained by many good transactions. By treating transactions as hubs and the products as authorities, the algorithm invokes HITS on this bipartite graph.

The `arules`

package provides the method (HITS), which implements the algorithm that we described earlier:

weights.vector <- hits( transactions.obj, type = "relative")weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)

We invoke the algorithm using the HITS method. We have described the intuition behind using the HITS algorithm in transaction databases to give weights to our transactions. We will briefly describe how the HITS algorithm functions. Curious readers can refer to *Authoritative Sources in a Hyperlinked Environment*, *J.M. Kleinberg*,* J. ACM*, vol. 46, no. 5, pp. 604-632, 1999, for a better understanding of the HITS algorithm.

The HITS algorithm, to begin with, initializes the weight of all nodes to one; in our case, the items and the transactions are set to one. That is, we maintain two arrays, one for hub weights and the other one for authority weights.

It proceeds to do the following three steps in an iterative manner, that is, until our hub and authority arrays stabilize or don't change with subsequent updates:

**Authority node score update**: Modify the authority score of each node to the sum of the hub scores of each node that points to it.**Hub node score update**: Change the hub score of each node to the sum of the authority scores of each node that it points to.**Unit normalize the hub and authority scores**: Continue with the authority node score update until the hub and authority value stabilizes.

At the end of the algorithm, every item has an authority score, which is the sum of the hub scores of all the transactions that contain this item. Every transaction has a hub score that is the sum of the authority score of all the items in that transaction. Using the weights created using the HITS algorithm, we create a `weights.df`

data frame:

**head(weights.df)**

transactionID weight1000431 1000431 1.893931e-04100082 100082 1.409053e-041000928 1000928 2.608214e-051001517 1001517 1.735461e-041001650 1001650 1.184581e-041001934 1001934 2.310465e-04

**Pass weights.df our transactions object. We can now generate the weighted association rules:**

**transactions.obj@itemsetInfo <- weights.df**

support <- 0.01parameters = list(support = support,minlen = 2, # Minimal number of items per item setmaxlen = 10, # Maximal number of items per item settarget = "frequent itemsets")weclat.itemsets <- weclat(transactions.obj, parameter = parameters)weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets), weclat.itemsets@quality)

weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.1)weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules), weclat.rules@quality)

**We can look into the output data frames created for frequent itemsets and rules:**

**head(weclat.itemsets.df)**

weclat.itemsets support1 {Banana,Russet Potato} 0.010741092 {Banana,Total 0% Nonfat Greek Yogurt} 0.011982063 {100% Raw Coconut Water,Bag of Organic Bananas} 0.010242014 {Organic Roasted Turkey Breast,Organic Strawberries} 0.011242785 {Banana,Roma Tomato} 0.010891246 {Banana,Bartlett Pears} 0.01345293

**tail(weclat.itemsets.df)**

weclat.itemsets support540 {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries} 0.02142840541 {Banana,Organic Baby Spinach,Organic Strawberries} 0.02446832542 {Bag of Organic Bananas,Organic Baby Spinach} 0.06536606543 {Banana,Organic Baby Spinach} 0.07685530544 {Bag of Organic Bananas,Organic Strawberries} 0.08640422545 {Banana,Organic Strawberries} 0.08226264

**head(weclat.rules.df)**

weclat.rules support confidence lift itemsetweclat.rules support confidence lift itemset1 {Russet Potato} => {Banana} 0.005580996 0.3714286 1.782653 13 {Total 0% Nonfat Greek Yogurt} => {Banana} 0.005580996 0.4148936 1.991261 26 {100% Raw Coconut Water} => {Bag of Organic Bananas} 0.004865484 0.3238095 1.993640 38 {Organic Roasted Turkey Breast} => {Organic Strawberries} 0.004579279 0.3440860 2.648098 49 {Roma Tomato} => {Banana} 0.006010303 0.3181818 1.527098 511 {Bartlett Pears} => {Banana} 0.007870635 0.4545455 2.181568 6

Based on the weights generated using the HITS algorithm, we can now order the items by their authority score. This is an alternate way of ranking the items in addition to ranking them by their frequency. We can leverage the `itemFrequency`

function to generate the item scores:

freq.weights <- head(sort(itemFrequency(transactions.obj, weighted = TRUE),decreasing = TRUE),20)freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted = FALSE),decreasing = TRUE),20)

compare.df <- data.frame("items" = names(freq.weights),"score" = freq.weights,"items.nw" = names(freq.nweights),"score.nw" = freq.nweights)row.names(compare.df) <- NULL

Let's look at the `compare.df`

data frame:

The column score gives the relative transaction frequency. The **score.new** column is the authority score from the hits algorithm. You can see that **Limes** and **Large Lemon** have interchanged places. **Strawberries** has gone further up the order compared to the original transaction frequency score.

The code is as follows:

################################################################################# R Data Analysis Projects## Chapter 1## Building Recommender System# A step step approach to build Association Rule Mining## Script:## RScript to explain application of hits to# transaction database.## Gopi Subramanian###############################################################################

**library(arules)**

get.txn <- function(data.path, columns){# Get transaction object for a given data file## Args:# data.path: data file name location# columns: transaction id and item id columns.## Returns:# transaction objecttransactions.obj <- read.transactions(file = data.path, format = "single",sep = ",",cols = columns,rm.duplicates = FALSE,quote = "", skip = 0,encoding = "unknown")return(transactions.obj)}

## Create txn objectcolumns <- c("order_id", "product_id") ## columns of interest in data filedata.path = '../../data/data.csv' ## Path to data filetransactions.obj <- get.txn(data.path, columns) ## create txn object

## Generate weight vector using hitsweights.vector <- hits( transactions.obj, type = "relative")weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)

**head(weights.df)**

**transactions.obj@itemsetInfo <- weights.df**

## Frequent item sets generationsupport <- 0.01parameters = list(support = support,minlen = 2, # Minimal number of items per item setmaxlen = 10, # Maximal number of items per item settarget = "frequent itemsets")weclat.itemsets <- weclat(transactions.obj, parameter = parameters)weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets), weclat.itemsets@quality)

head(weclat.itemsets.df)tail(weclat.itemsets.df)

## Rule inductionweclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.3)weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules), weclat.rules@quality)

**head(weclat.rules.df)**

freq.weights <- head(sort(itemFrequency(transactions.obj, weighted = TRUE),decreasing = TRUE),20)freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted = FALSE),decreasing = TRUE),20)

compare.df <- data.frame("items" = names(freq.weights),"score" = freq.weights,"items.nw" = names(freq.nweights),"score.nw" = freq.nweights)row.names(compare.df) <- NULL