Book Image

R for Data Science

By : Dan Toomey
Book Image

R for Data Science

By: Dan Toomey

Overview of this book

Table of Contents (19 chapters)

Association rules


Association rules describe associations between two datasets. This is most commonly used in market basket analysis. Given a set of transactions with multiple, different items per transaction (shopping bag), how can the item sales be associated? The most common associations are as follows:

  • Support: This is the percentage of transactions that contain A and B.

  • Confidence: This is the percentage (of time that rule is correct) of cases containing A that also contain B.

  • Lift: This is the ratio of confidence to the percentage of cases containing B. Please note that if lift is 1, then A and B are independent.

Mine for associations

The most widely used tool in R from association rules is apriori.

Usage

The apriori rules library can be called as follows:

apriori(data, parameter = NULL, appearance = NULL, control = NULL)

The various parameters of the apriori library are explained in the following table:

Parameter

Description

data

This is the transaction data.

parameter

This stores the default behavior to mine, with support as 0.1, confidence as 0.8, and maxlen as 10. You can change parameter values accordingly.

appearance

This is used to restrict items that appear in rules.

control

This is used to adjust the performance of the algorithm used.

Example

You will need to load the apriori rules library as follows:

> install.packages("arules")
> library(arules)

The market basket data can be loaded as follows:

> data <- read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/03/groceries.csv")

Then, we can generate rules from the data as follows:

> rules <- apriori(data) 

parameter specification:
confidenceminvalsmaxaremavaloriginalSupport support minlenmaxlen target
        0.8    0.1    1 none FALSE            TRUE     0.1      1     10  rules
   ext
 FALSE

algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[655 item(s), 15295 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [5 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

There are several points to highlight in the results:

  • As you can see from the display, we are using the default settings (confidence 0.8, and so on)

  • We found 15,000 transactions for three items (picked from the 655 total items available)

  • We generated five rules

We can examine the rules that were generated as follows:

> rules

set of 5 rules 
> inspect(rules)

lhsrhs              support confidence     lift
1 {semi.finished.bread=} => {margarine=}   0.2278522          1 2.501226
2 {semi.finished.bread=} => {ready.soups=} 0.2278522          1 1.861385
3 {margarine=}           => {ready.soups=} 0.3998039          1 1.861385
4 {semi.finished.bread=,                                                
   margarine=}           => {ready.soups=} 0.2278522          1 1.861385
5 {semi.finished.bread=,                                                
   ready.soups=}         => {margarine=}   0.2278522          1 2.501226

The code has been slightly reformatted for readability.

Looking over the rules, there is a clear connection between buying bread, soup, and margarine—at least in the market where and when the data was gathered.

If we change the parameters (thresholds) used in the calculation, we get a different set of rules. For example, check the following code:

> rules <- apriori(data, parameter = list(supp = 0.001, conf = 0.8))

This code generates over 500 rules, but they have questionable meaning as we now have the rules with 0.001 confidence.