Book Image

R Data Analysis Projects

Book Image

R Data Analysis Projects

Overview of this book

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it’s one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. This book will demonstrate how you can put to use your existing knowledge of data analysis in R to build highly efficient, end-to-end data analysis pipelines without any hassle. You’ll start by building a content-based recommendation system, followed by building a project on sentiment analysis with tweets. You’ll implement time-series modeling for anomaly detection, and understand cluster analysis of streaming data. You’ll work through projects on performing efficient market data research, building recommendation systems, and analyzing networks accurately, all provided with easy to follow codes. With the help of these real-world projects, you’ll get a better understanding of the challenges faced when building data analysis pipelines, and see how you can overcome them without compromising on the efficiency or accuracy of your systems. The book covers some popularly used R packages such as dplyr, ggplot2, RShiny, and others, and includes tips on using them effectively. By the end of this book, you’ll have a better understanding of data analysis with R, and be able to put your knowledge to practical use without any hassle.
Table of Contents (15 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Hyperlink-induced topic search (HITS)


In the previous section, we saw that weighted association rule mining leads to recommendations that can increase the profitability of the retailer. Intuitively, weighted association rule mining is superior to vanilla association rule mining as the generated rules are sensitive to transactions, weights. Instead of running the plain version of the algorithm, can we run the weighted association algorithm? We were lucky enough to get transaction weights. What if the retailer did not provide us with the transaction weights? Can we infer weights from transactions? When our transaction data does not come with preassigned weights, we need some way to assign importance to those transactions. For instance, we can say that a transaction with a lot of items should be considered more important than a transaction with a single item.

The arules package provides a method called HITS to help us do the exact same thing—infer transaction weights. HITS stands for Hyperlink-induced Topic Search—a link analysis algorithm used to rate web pages developed by John Kleinberg. According to HITS, hubs are pages with large numbers of out degrees and authority are pages with large numbers of in degrees.

Note

According to graph theory, a graph is formed by a set of vertices and edges. Edges connect the vertices. Graphs with directed edges are called directed graphs. The in degree is the number of inward directed edges from a given vertex in a directed graph. Similarly, the out degree is the number of outward directed edges from a given vertex in a directed graph.

The rationale is that if a lot of pages link to a particular page, then that page is considered an authority. If a page has links to a lot of other pages, it is considered a hub.

Note

The paper, Mining Weighted Association Rules without Preassigned Weights by Ke Sun and Fengshan Bai, details the adaptation of the HITS algorithm for transaction databases.

The basic idea behind using the HITS algorithm for association rule mining is that frequent items may not be as important as they seem. The paper presents an algorithm that applies the HITS methodology to bipartite graphs.

Note

In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V. Vertex sets are usually called the parts of the graph. - Wikipedia (https://en.wikipedia.org/wiki/Bipartite_graph)

According to the HITS modified for the transactions database, the transactions and products are treated as a bipartite graph, with an arc going from the transaction to the product if the product is present in the transaction. The following diagram is reproduced from the paper, Mining Weighted Association Rules without Preassigned Weights by Ke Sun and Fengshan Bai, to illustrate how transactions in a database are converted to a bipartite graph:

In this representation, the transactions can be ranked using the HITS algorithm. In this kind of representation, the support of an item is proportional to its degree. Consider item A. Its absolute support is 4; in the graph, the in degree of A is four. As you can see, considering only the support, we totally ignore transaction importance, unless the importance of the transactions is explicitly provided. How do we get the transaction weights in this scenario? Again, a way to get the weights intuitively is from a good transaction, which should be highly weighted and should contain many good items. A good item should be contained by many good transactions. By treating transactions as hubs and the products as authorities, the algorithm invokes HITS on this bipartite graph.

The arules package provides the method (HITS), which implements the algorithm that we described earlier:

weights.vector <- hits( transactions.obj, type = "relative")
weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)

We invoke the algorithm using the HITS method. We have described the intuition behind using the HITS algorithm in transaction databases to give weights to our transactions. We will briefly describe how the HITS algorithm functions. Curious readers can refer to Authoritative Sources in a Hyperlinked EnvironmentJ.M. Kleinberg, J. ACM, vol. 46, no. 5, pp. 604-632, 1999, for a better understanding of the HITS algorithm.

The HITS algorithm, to begin with, initializes the weight of all nodes to one; in our case, the items and the transactions are set to one. That is, we maintain two arrays, one for hub weights and the other one for authority weights.

It proceeds to do the following three steps in an iterative manner, that is, until our hub and authority arrays stabilize or don't change with subsequent updates:

  1. Authority node score update: Modify the authority score of each node to the sum of the hub scores of each node that points to it.
  2. Hub node score update: Change the hub score of each node to the sum of the authority scores of each node that it points to.
  3. Unit normalize the hub and authority scores: Continue with the authority node score update until the hub and authority value stabilizes.

At the end of the algorithm, every item has an authority score, which is the sum of the hub scores of all the transactions that contain this item. Every transaction has a hub score that is the sum of the authority score of all the items in that transaction. Using the weights created using the HITS algorithm, we create a weights.df data frame:

head(weights.df)
                transactionID weight
1000431 1000431 1.893931e-04
100082 100082 1.409053e-04
1000928 1000928 2.608214e-05
1001517 1001517 1.735461e-04
1001650 1001650 1.184581e-04
1001934 1001934 2.310465e-04
Pass weights.df our transactions object. We can now generate the weighted association rules:
transactions.obj@itemsetInfo <- weights.df
support <- 0.01
 parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets"

 )
 weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
 weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
 , weclat.itemsets@quality)
weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.1)
 weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
 , weclat.rules@quality)
We can look into the output data frames created for frequent itemsets and rules:
head(weclat.itemsets.df)
  weclat.itemsets support
1 {Banana,Russet Potato} 0.01074109
2 {Banana,Total 0% Nonfat Greek Yogurt} 0.01198206
3 {100% Raw Coconut Water,Bag of Organic Bananas} 0.01024201
4 {Organic Roasted Turkey Breast,Organic Strawberries} 0.01124278
5 {Banana,Roma Tomato} 0.01089124
6 {Banana,Bartlett Pears} 0.01345293
tail(weclat.itemsets.df)
  weclat.itemsets support
540 {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries} 0.02142840
541 {Banana,Organic Baby Spinach,Organic Strawberries} 0.02446832
542 {Bag of Organic Bananas,Organic Baby Spinach} 0.06536606
543 {Banana,Organic Baby Spinach} 0.07685530
544 {Bag of Organic Bananas,Organic Strawberries} 0.08640422
545 {Banana,Organic Strawberries} 0.08226264
head(weclat.rules.df)
weclat.rules support confidence lift itemset
weclat.rules support confidence lift itemset
1 {Russet Potato} => {Banana} 0.005580996 0.3714286 1.782653 1
3 {Total 0% Nonfat Greek Yogurt} => {Banana} 0.005580996 0.4148936 1.991261 2
6 {100% Raw Coconut Water} => {Bag of Organic Bananas} 0.004865484 0.3238095 1.993640 3
8 {Organic Roasted Turkey Breast} => {Organic Strawberries} 0.004579279 0.3440860 2.648098 4
9 {Roma Tomato} => {Banana} 0.006010303 0.3181818 1.527098 5
11 {Bartlett Pears} => {Banana} 0.007870635 0.4545455 2.181568 6

Based on the weights generated using the HITS algorithm, we can now order the items by their authority score. This is an alternate way of ranking the items in addition to ranking them by their frequency. We can leverage the itemFrequency function to generate the item scores:

freq.weights <- head(sort(itemFrequency(transactions.obj, weighted = TRUE),decreasing = TRUE),20)
freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted = FALSE),decreasing = TRUE),20)
compare.df <- data.frame("items" = names(freq.weights),
"score" = freq.weights,
"items.nw" = names(freq.nweights),
"score.nw" = freq.nweights)
row.names(compare.df) <- NULL

Let's look at the compare.dfdata frame:

The column score gives the relative transaction frequency. The score.new column is the authority score from the hits algorithm. You can see that Limes and Large Lemon have interchanged places. Strawberries has gone further up the order compared to the original transaction frequency score.

The code is as follows:

###############################################################################
 #
 # R Data Analysis Projects
 #
 # Chapter 1
 #
 # Building Recommender System
 # A step step approach to build Association Rule Mining
 #
 # Script:
 #
 # RScript to explain application of hits to
 # transaction database.
 #
 # Gopi Subramanian
 ###############################################################################
library(arules)
get.txn <- function(data.path, columns){
 # Get transaction object for a given data file
 #
 # Args:
 # data.path: data file name location
 # columns: transaction id and item id columns.
 #
 # Returns:
 # transaction object
 transactions.obj <- read.transactions(file = data.path, format = "single",
 sep = ",",
 cols = columns,
 rm.duplicates = FALSE,
 quote = "", skip = 0,
 encoding = "unknown")
 return(transactions.obj)
 }
## Create txn object
 columns <- c("order_id", "product_id") ## columns of interest in data file
 data.path = '../../data/data.csv' ## Path to data file
 transactions.obj <- get.txn(data.path, columns) ## create txn object
## Generate weight vector using hits
 weights.vector <- hits( transactions.obj, type = "relative")
 weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)
head(weights.df)
transactions.obj@itemsetInfo <- weights.df
## Frequent item sets generation
 support <- 0.01
 parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets"

 )
 weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
 weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
 , weclat.itemsets@quality)
head(weclat.itemsets.df)
 tail(weclat.itemsets.df)
## Rule induction
 weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.3)
 weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
 , weclat.rules@quality)
head(weclat.rules.df)
freq.weights <- head(sort(itemFrequency(transactions.obj, weighted = TRUE),decreasing = TRUE),20)
 freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted = FALSE),decreasing = TRUE),20)
compare.df <- data.frame("items" = names(freq.weights),
 "score" = freq.weights,
 "items.nw" = names(freq.nweights),
 "score.nw" = freq.nweights)
 row.names(compare.df) <- NULL