In this chapter, we will learn how to mine frequent patterns, association rules, and correlation rules when working with R programs. Then, we will evaluate all these methods with benchmark data to determine the interestingness of the frequent patterns and rules. We will cover the following topics in this chapter:
- Introduction to associations and patterns
- Market basket analysis
- Hybrid association rules mining
- Mining sequence datasets
- High-performance algorithms
The algorithms to find frequent items from various data types can be applied to numeric or categorical data. Most of these algorithms have one common basic algorithmic form, which is A-Priori, depending on certain circumstances. Another basic algorithm is FP-Growth, which is similar to A-Priori. Most pattern-related mining algorithms derive from these basic algorithms.
With frequent patterns found as one input, many algorithms are designed to find association and correlation rules. Each algorithm is only a variation from the basic algorithm.
Along with the growth, size, and types of datasets from various domains, new algorithms are designed, such as the multistage algorithm, the multihash algorithm, and the limited-pass algorithm.
One popular task for data mining is to find relations among the source dataset; this is based on searching frequent patterns from various data sources, such as market baskets, graphs, and streams.
All the algorithms illustrated in this chapter are written from scratch in the R language for the purpose of explaining association analysis, and the code will be demonstrated using the standard R packages for the algorithms such as arules.
With many applications across a broad field, frequent pattern mining is often used in solving various problems, such as the market investigation for a shopping mall from the transaction data.
Frequent patterns are the ones that often occur in the source dataset. The dataset types for frequent pattern mining can be itemset, subsequence, or substructure. As a result, the frequent patterns found are known as:
- Frequent itemset
- Frequent subsequence
- Frequent substructures
These three frequent patterns will be discussed in detail in the upcoming sections.
These newly founded frequent patterns will serve as an important platform when searching for recurring interesting rules or relationships among the given dataset.
Various patterns are proposed to improve the efficiency of mining on a dataset. Some of them are as follows; they will be defined in detail later:
- Closed patterns
- Maximal patterns
- Approximate patterns
- Condensed patterns
- Discriminative frequent patterns
The frequent itemset originated from true market basket analysis. In a store such as Amazon, there are many orders or transactions; a certain customer performs a transaction where their Amazon shopping cart includes some items. The mass result of all customers' transactions can be used by the storeowner to find out what items are purchased together by customers. As a simple definition, itemset denotes a collection of zero or more items.
We call a transaction a
basket, and a set of items can belong to any basket. We will set the variable s
as the support threshold, which is compared with the count of a certain set of items that appear in all the baskets. If the count of a certain set of items that appear in all the baskets is not less than s
, we would call the itemset a frequent itemset.
An itemset is called a k-itemset if it contains k pieces of items, where k is a non-zero integer. The support count of an itemset is , the count of itemset contained X, given the dataset.
For a predefined minimum support threshold s, the itemset X
is a frequent itemset if . The minimum support threshold s
is a customizable parameter, which can be adjusted by domain experts or experiences.
The frequent itemset is also used in many domains. Some of them are shown in the following table:
Items |
Baskets |
Comments | |
---|---|---|---|
Related concepts |
Words |
Documents | |
Plagiarism |
Documents |
Sentences | |
Biomarkers |
Biomarkers and diseases |
The set of data about a patient |
If an itemset is frequent, then any of its subset must be frequent. This is known as the A-Priori principle, the foundation of the A-Priori algorithm. The direct application of the A-Priori principle is to prune the huge number of frequent itemsets.
One important factor that affects the number of frequent itemsets is the minimum support count: the lower the minimum support count, the larger the number of frequent itemsets.
For the purpose of optimizing the frequent itemset-generation algorithm, some more concepts are proposed:
- An itemset X is closed in dataset S, if ; X is also called a closed itemset. In other words, if X is frequent, then X is a closed frequent itemset.
-
An itemset X is a maximal frequent itemset if ; in other words,
Y
does not have frequent supersets. - An itemset X is considered a constrained frequent itemset once the frequent itemset satisfies the user-specified constraints.
- An itemset X is an approximate frequent itemset if X derives only approximate support counts for the mined frequent itemsets.
- An itemset X is a top-k frequent itemset in the dataset S if X is the k-most frequent itemset, given a user-defined value k.
The following example is of a transaction dataset. All itemsets only contain items from the set, .Let's assume that the minimum support count is 3.
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
Then, we will get the frequent itemsets and .
The frequent sequence is an ordered list of elements where each element contains at least one event. An example of this is the page-visit sequence on a site by the specific web page the user is on more concretely speaking, the order in which a certain user visits web pages. Here are two examples of the frequent subsequence:
- Customer: Successive shopping records of certain customers in a shopping mart serves as the sequence, each item bought serves as the event item, and all the items bought by a customer in one shopping are treated as elements or transactions
- Web usage data: Users who visit the history of the WWW are treated as a sequence, each UI/page serves as the event or item, and the element or transaction can be defined as the pages visited by users with one click of the mouse
The length of a sequence is defined by the number of items contained in the sequence. A sequence of length k is called a k-sequence. The size of a sequence is defined by the number of itemsets in the sequence. We call a sequence as a subsequence of the sequence or as the super sequence of when is satisfied.
In some domains, the tasks under research can be modeled with a graph theory. As a result, there are requirements for mining common subgraphs (subtrees or sublattices); some examples are as follows:
- Web mining: Web pages are treated as the vertices of graph, links between pages serve as edges, and a user's page-visiting records construct the graph.
- Network computing: Any device with computation ability on the network serves as the vertex, and the interconnection between these devices serves as the edge. The whole network that is made up of these devices and interconnections is treated as a graph.
- Semantic web: XML elements serve as the vertices, and the parent/child relations between them are edges; all these XML files are treated as graphs.
A graph G is represented by G = (V, E)
, where V
represents a group of vertices, and E
represents a group of edges. A graph is called as subgraph of graph G = (V, E)
once and . Here is an example of a subgraph. There is the original graph with vertices and edges on the left-hand side of the following figure and the subgraph on the right-hand side with some edges omitted (or omission of vertices in other circumstances):
Mining of association rules is based on the frequent patterns found. The different emphases on the interestingness of relations derives two types of relations for further research: association rules and correlation rules.
In a later section, a method to show association analysis is illustrated; this is a useful method to discover interesting relationships within a huge dataset. The relations can be represented in the form of association rules or frequent itemsets.
Association rule mining is to find the result rule set on a given dataset (the transaction data set or other sequence-pattern-type dataset), a predefined minimum support count s, and a predefined confidence c, given any found rule , and .
is an association rule where ; X and Y are disjoint. The interesting thing about this rule is that it is measured by its support and confidence. Support means the frequency in which this rule appears in the dataset, and confidence means the probability of the appearance of Y
when X
is present.
For association rules, the key measures of rule interestingness are rule support and confidence. Their relationship is given as follows:
support_count(X)
is the count of itemset in the dataset, contained X
.
As a convention, in support_count(X)
, in the confidence value and support count value are represented as a percentage between 0 and 100.
The association rule is strong once and . The predefined minimum support threshold is s, and c
is the predefined minimum confidence threshold.
The meaning of the found association rules should be explained with caution, especially when there is not enough to judge whether the rule implies causality. It only shows the co-occurrence of the prefix and postfix of the rule. The following are the different kinds of rules you can come across:
- A rule is a Boolean association rule if it contains association of the presence of the item
- A rule is a single-dimensional association if there is, at the most, only one dimension referred to in the rules
- A rule is a multidimensional association rule if there are at least two dimensions referred to in the rules
- A rule is a correlation-association rule if the relations or rules are measured by statistical correlation, which, once passed, leads to a correlation rule
- A rule is a quantitative-association rule if at least one item or attribute contained in it is quantitative
In some situations, the support and confidence pairs are not sufficient to filter uninteresting association rules. In such a case, we will use support count, confidence, and correlations to filter association rules.
There are a lot of methods to calculate the correlation of an association rule, such as analyses, all-confidence analysis, and cosine. For a k-itemset , define the all-confidence value of X as:
Market basket analysis is the methodology used to mine a shopping cart of items bought or just those kept in the cart by customers. The concept is applicable to a variety of applications, especially for store operations. The source dataset is a massive data record. The aim of market basket analysis is to find the association rules between the items within the source dataset.
The market basket model is a model that illustrates the relation between a basket and its associated items. Many tasks from different areas of research have this relation in common. To summarize them all, the market basket model is suggested as the most typical example to be researched.
The basket is also known as the transaction set; this contains the itemsets that are sets of items belonging to same itemset.
The A-Priori algorithm is a level wise, itemset mining algorithm. The Eclat algorithm is a tidset intersection itemset mining algorithm based on tidset intersection in contrast to A-Priori. FP-growth is a frequent pattern tree algorithm. The tidset denotes a collection of zeros or IDs of transaction records.
As a common strategy to design algorithms, the problem is divided into two subproblems:
- The frequent itemset generation
- Rule generation
The strategy dramatically decreases the search space for association mining algorithms.
As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.
The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.
There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.
The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.
Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.
- Join action: Given that is the set of frequent k-itemsets, a set of candidates to find is generated. Let's call it .
- Prune action: , the size of , the candidate itemset, is usually much bigger than , to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of .
Here is the pseudocode to find all the frequent itemsets:
R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP
is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.
The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:
Apriori <- function (data, I, MIN_SUP, parameter = NULL){ f <- CreateItemsets() c <- FindFrequentItemset(data,I,1, MIN_SUP) k <- 2 len4data <- GetDatasetSize(data) while( !IsEmpty(c[[k-1]]) ){ f[[k]] <- AprioriGen(c[k-1]) for( idx in 1: len4data ){ ft <- GetSubSet(f[[k]],data[[idx]]) len4ft <- GetDatasetSize(ft) for( jdx in 1:len4ft ){ IncreaseSupportCount(f[[k]],ft[jdx]) } } c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP) k <- k+1 } c }
To verify the R code, the arules package is applied while verifying the output.
Given:
At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:
Let's assume that the minimum support count is 5; the following table is an input dataset:
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
In the first scan or pass of the dataset D, get the count of each candidate itemset . The candidate itemset and its related count:
Itemset |
Support count |
---|---|
6 | |
8 | |
2 | |
5 | |
2 | |
3 | |
3 |
We will get the after comparing the support count with minimum support count.
Itemset |
Support count |
---|---|
6 | |
8 | |
5 |
We will generate by , .
Itemset |
Support count |
---|---|
4 | |
3 | |
4 |
After comparing the support count with the minimum support count, we will get . The algorithm then terminates.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The market basket model
The market basket model is a model that illustrates the relation between a basket and its associated items. Many tasks from different areas of research have this relation in common. To summarize them all, the market basket model is suggested as the most typical example to be researched.
The basket is also known as the transaction set; this contains the itemsets that are sets of items belonging to same itemset.
The A-Priori algorithm is a level wise, itemset mining algorithm. The Eclat algorithm is a tidset intersection itemset mining algorithm based on tidset intersection in contrast to A-Priori. FP-growth is a frequent pattern tree algorithm. The tidset denotes a collection of zeros or IDs of transaction records.
As a common strategy to design algorithms, the problem is divided into two subproblems:
- The frequent itemset generation
- Rule generation
The strategy dramatically decreases the search space for association mining algorithms.
As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.
The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.
There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.
The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.
Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.
- Join action: Given that is the set of frequent k-itemsets, a set of candidates to find is generated. Let's call it .
- Prune action: , the size of , the candidate itemset, is usually much bigger than , to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of .
Here is the pseudocode to find all the frequent itemsets:
R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP
is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.
The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:
Apriori <- function (data, I, MIN_SUP, parameter = NULL){ f <- CreateItemsets() c <- FindFrequentItemset(data,I,1, MIN_SUP) k <- 2 len4data <- GetDatasetSize(data) while( !IsEmpty(c[[k-1]]) ){ f[[k]] <- AprioriGen(c[k-1]) for( idx in 1: len4data ){ ft <- GetSubSet(f[[k]],data[[idx]]) len4ft <- GetDatasetSize(ft) for( jdx in 1:len4ft ){ IncreaseSupportCount(f[[k]],ft[jdx]) } } c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP) k <- k+1 } c }
To verify the R code, the arules package is applied while verifying the output.
Given:
At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:
Let's assume that the minimum support count is 5; the following table is an input dataset:
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
In the first scan or pass of the dataset D, get the count of each candidate itemset . The candidate itemset and its related count:
Itemset |
Support count |
---|---|
6 | |
8 | |
2 | |
5 | |
2 | |
3 | |
3 |
We will get the after comparing the support count with minimum support count.
Itemset |
Support count |
---|---|
6 | |
8 | |
5 |
We will generate by , .
Itemset |
Support count |
---|---|
4 | |
3 | |
4 |
After comparing the support count with the minimum support count, we will get . The algorithm then terminates.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
A-Priori algorithms
As a common strategy to design algorithms, the problem is divided into two subproblems:
- The frequent itemset generation
- Rule generation
The strategy dramatically decreases the search space for association mining algorithms.
As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.
The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.
There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.
The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.
Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.
- Join action: Given that is the set of frequent k-itemsets, a set of candidates to find is generated. Let's call it .
- Prune action: , the size of , the candidate itemset, is usually much bigger than , to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of .
Here is the pseudocode to find all the frequent itemsets:
R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP
is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.
The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:
Apriori <- function (data, I, MIN_SUP, parameter = NULL){ f <- CreateItemsets() c <- FindFrequentItemset(data,I,1, MIN_SUP) k <- 2 len4data <- GetDatasetSize(data) while( !IsEmpty(c[[k-1]]) ){ f[[k]] <- AprioriGen(c[k-1]) for( idx in 1: len4data ){ ft <- GetSubSet(f[[k]],data[[idx]]) len4ft <- GetDatasetSize(ft) for( jdx in 1:len4ft ){ IncreaseSupportCount(f[[k]],ft[jdx]) } } c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP) k <- k+1 } c }
To verify the R code, the arules package is applied while verifying the output.
Given:
At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:
Let's assume that the minimum support count is 5; the following table is an input dataset:
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
In the first scan or pass of the dataset D, get the count of each candidate itemset . The candidate itemset and its related count:
Itemset |
Support count |
---|---|
6 | |
8 | |
2 | |
5 | |
2 | |
3 | |
3 |
We will get the after comparing the support count with minimum support count.
Itemset |
Support count |
---|---|
6 | |
8 | |
5 |
We will generate by , .
Itemset |
Support count |
---|---|
4 | |
3 | |
4 |
After comparing the support count with the minimum support count, we will get . The algorithm then terminates.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
Input data characteristics and data structure
As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.
The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.
There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.
The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.
Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.
- Join action: Given that is the set of frequent k-itemsets, a set of candidates to find is generated. Let's call it .
- Prune action: , the size of , the candidate itemset, is usually much bigger than , to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of .
Here is the pseudocode to find all the frequent itemsets:
R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP
is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.
The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:
Apriori <- function (data, I, MIN_SUP, parameter = NULL){ f <- CreateItemsets() c <- FindFrequentItemset(data,I,1, MIN_SUP) k <- 2 len4data <- GetDatasetSize(data) while( !IsEmpty(c[[k-1]]) ){ f[[k]] <- AprioriGen(c[k-1]) for( idx in 1: len4data ){ ft <- GetSubSet(f[[k]],data[[idx]]) len4ft <- GetDatasetSize(ft) for( jdx in 1:len4ft ){ IncreaseSupportCount(f[[k]],ft[jdx]) } } c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP) k <- k+1 } c }
To verify the R code, the arules package is applied while verifying the output.
Given:
At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:
Let's assume that the minimum support count is 5; the following table is an input dataset:
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
In the first scan or pass of the dataset D, get the count of each candidate itemset . The candidate itemset and its related count:
Itemset |
Support count |
---|---|
6 | |
8 | |
2 | |
5 | |
2 | |
3 | |
3 |
We will get the after comparing the support count with minimum support count.
Itemset |
Support count |
---|---|
6 | |
8 | |
5 |
We will generate by , .
Itemset |
Support count |
---|---|
4 | |
3 | |
4 |
After comparing the support count with the minimum support count, we will get . The algorithm then terminates.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The A-Priori algorithm
Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.
- Join action: Given that is the set of frequent k-itemsets, a set of candidates to find is generated. Let's call it .
- Prune action: , the size of , the candidate itemset, is usually much bigger than , to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of .
Here is the pseudocode to find all the frequent itemsets:
R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP
is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.
The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:
Apriori <- function (data, I, MIN_SUP, parameter = NULL){ f <- CreateItemsets() c <- FindFrequentItemset(data,I,1, MIN_SUP) k <- 2 len4data <- GetDatasetSize(data) while( !IsEmpty(c[[k-1]]) ){ f[[k]] <- AprioriGen(c[k-1]) for( idx in 1: len4data ){ ft <- GetSubSet(f[[k]],data[[idx]]) len4ft <- GetDatasetSize(ft) for( jdx in 1:len4ft ){ IncreaseSupportCount(f[[k]],ft[jdx]) } } c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP) k <- k+1 } c }
To verify the R code, the arules package is applied while verifying the output.
Given:
At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:
Let's assume that the minimum support count is 5; the following table is an input dataset:
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
In the first scan or pass of the dataset D, get the count of each candidate itemset . The candidate itemset and its related count:
Itemset |
Support count |
---|---|
6 | |
8 | |
2 | |
5 | |
2 | |
3 | |
3 |
We will get the after comparing the support count with minimum support count.
Itemset |
Support count |
---|---|
6 | |
8 | |
5 |
We will generate by , .
Itemset |
Support count |
---|---|
4 | |
3 | |
4 |
After comparing the support count with the minimum support count, we will get . The algorithm then terminates.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The R implementation
R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP
is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.
The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:
Apriori <- function (data, I, MIN_SUP, parameter = NULL){ f <- CreateItemsets() c <- FindFrequentItemset(data,I,1, MIN_SUP) k <- 2 len4data <- GetDatasetSize(data) while( !IsEmpty(c[[k-1]]) ){ f[[k]] <- AprioriGen(c[k-1]) for( idx in 1: len4data ){ ft <- GetSubSet(f[[k]],data[[idx]]) len4ft <- GetDatasetSize(ft) for( jdx in 1:len4ft ){ IncreaseSupportCount(f[[k]],ft[jdx]) } } c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP) k <- k+1 } c }
To verify the R code, the arules package is applied while verifying the output.
Given:
At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:
Let's assume that the minimum support count is 5; the following table is an input dataset:
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
In the first scan or pass of the dataset D, get the count of each candidate itemset . The candidate itemset and its related count:
Itemset |
Support count |
---|---|
6 | |
8 | |
2 | |
5 | |
2 | |
3 | |
3 |
We will get the after comparing the support count with minimum support count.
Itemset |
Support count |
---|---|
6 | |
8 | |
5 |
We will generate by , .
Itemset |
Support count |
---|---|
4 | |
3 | |
4 |
After comparing the support count with the minimum support count, we will get . The algorithm then terminates.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The Eclat algorithm
The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id
, tid set>
instead of <tid
, item id set>
that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.
The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.
The support count is computed by the intersection between tidsets. Given two tidsets, X
and Y
, is the cardinality of . The pseudocode is , .
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The R implementation
Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f
is set to empty, and p
is the set of frequent 1-itemsets:
Eclat <- function (p,f,MIN_SUP){ len4tidsets <- length(p) for(idx in 1:len4tidsets){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in idx:len4tidsets){ if(ItemCompare(p[[jdx]],p[[idx]]) > 0){ xab <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xab)>=MIN_SUP){ AddFrequentItemset(pa,xab, GetSupport(xab)) } } } if(!IsEmptyTidSets(pa)){ Eclat(pa,f,MIN_SUP) } } }
Here is the running result of one example, I = {beer, chips, pizza, wine}
. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:
tid |
X |
---|---|
1 |
|
2 |
|
3 |
|
4 |
|
x |
tidset |
---|---|
|
|
|
|
|
|
|
|
The binary format of this information is in the following table.
tid |
beer |
chips |
pizza |
wine |
---|---|---|---|---|
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
1 |
4 |
0 |
1 |
1 |
0 |
Before calling the Eclat algorithm, we will set MIN_SUP=2, ,
The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}
:
The output of the Eclat function can be verified with the R add-on package, arules.
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The FP-growth algorithm
The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
Input data characteristics and data structure
The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.
tid |
X |
---|---|
1 |
{A, B, C, D, E} |
2 |
{A, B, C, E} |
3 |
{A, D, E} |
4 |
{B, E, D} |
5 |
{B, E, C} |
6 |
{E, C, D} |
7 |
{E, D} |
The sorted item order is listed in the following table:
item |
E |
D |
C |
B |
A |
support_count |
7 |
5 |
4 |
4 |
3 |
Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:
tid |
X |
---|---|
1 |
{E, D, C, B, A} |
2 |
{E, C, B, A} |
3 |
{E, D, A} |
4 |
{E, D, B} |
5 |
{E, C, B} |
6 |
{E, D, C} |
7 |
{E, D} |
The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.
The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.
Given MIN_SUP=3, the following itemsets can be processed according to this logic:
The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:
A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.
The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The R implementation
Here is the R source code of the main FP-growth algorithm:
FPGrowth <- function (r,p,f,MIN_SUP){ RemoveInfrequentItems(r) if(IsPath(r)){ y <- GetSubset(r) len4y <- GetLength(y) for(idx in 1:len4y){ x <- MergeSet(p,y[idx]) SetSupportCount(x, GetMinCnt(x)) Add2Set(f,x,support_count(x)) } }else{ len4r <- GetLength(r) for(idx in 1:len4r){ x <- MergeSet(p,r[idx]) SetSupportCount(x, GetSupportCount(r[idx])) rx <- CreateProjectedFPTree() path4idx <- GetAllSubPath(PathFromRoot(r,idx)) len4path <- GetLength(path4idx) for( jdx in 1:len4path ){ CountCntOnPath(r, idx, path4idx, jdx) InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx)) } if( !IsEmpty(rx) ){ FPGrowth(rx,x,f,MIN_SUP) } } } }
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The GenMax algorithm with maximal frequent itemsets
The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.
The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.
Here is the pseudocode before calling the GenMax algorithm, , where D is the input transaction dataset.
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The R implementation
Here is the R source code of the main GenMax algorithm:
GenMax <- function (p,m,MIN_SUP){ y <- GetItemsetUnion(p) if( SuperSetExists(m,y) ){ return } len4p <- GetLenght(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ AddFrequentItemset(q,xij,GetSupport(xij)) } } if( !IsEmpty(q) ){ GenMax(q,m,MIN_SUP) }else if( !SuperSetExists(m,p[[idx]]) ){ Add2MFI(m,p[[idx]]) } } }
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The Charm algorithm with closed frequent itemsets
Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.
Here is the pseudocode before calling the Charm algorithm, , where D is the input transaction dataset.
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The R implementation
Here is the R source code of the main algorithm:
Charm <- function (p,c,MIN_SUP){ SortBySupportCount(p) len4p <- GetLength(p) for(idx in 1:len4p){ q <- GenerateFrequentTidSet() for(jdx in (idx+1):len4p){ xij <- MergeTidSets(p[[idx]],p[[jdx]]) if(GetSupport(xij)>=MIN_SUP){ if( IsSameTidSets(p,idx,jdx) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) RemoveTidSet(p,jdx) }else{ if( IsSuperSet(p[[idx]],p[[jdx]]) ){ ReplaceTidSetBy(p,idx,xij) ReplaceTidSetBy(q,idx,xij) }else{ Add2CFI(q,xij) } } } } if( !IsEmpty(q) ){ Charm(q,c,MIN_SUP) } if( !IsSuperSetExists(c,p[[idx]]) ){ Add2CFI(m,p[[idx]]) } } }
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The algorithm to generate association rules
During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.
To generate an association rule , l
is a frequent itemset. Two steps are needed:
- First to get all the nonempty subsets of l.
- Then, for subset X of l, , the rule is a strong association rule only if The support count of any rule of a frequent itemset is not less than the minimum support count.
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
The R implementation
R code of the algorithm to generate A-Priori association is as follows:
Here is the R source code of the main algorithm:AprioriGenerateRules <- function (D,F,MIN_SUP,MIN_CONF){ #create empty rule set r <- CreateRuleSets() len4f <- length(F) for(idx in 1:len4f){ #add rule F[[idx]] => {} AddRule2RuleSets(r,F[[idx]],NULL) c <- list() c[[1]] <- CreateItemSets(F[[idx]]) h <- list() k <-1 while( !IsEmptyItemSets(c[[k]]) ){ #get heads of confident association rule in c[[k]] h[[k]] <- getPrefixOfConfidentRules(c[[k]], F[[idx]],D,MIN_CONF) c[[k+1]] <- CreateItemSets() #get candidate heads len4hk <- length(h[[k]]) for(jdx in 1:(len4hk-1)){ if( Match4Itemsets(h[[k]][jdx], h[[k]][jdx+1]) ){ tempItemset <- CreateItemset (h[[k]][jdx],h[[k]][jdx+1][k]) if( IsSubset2Itemsets(h[[k]], tempItemset) ){ Append2ItemSets(c[[k+1]], tempItemset) } } } } #append all new association rules to rule set AddRule2RuleSets(r,F[[idx]],h) } r }
To verify the R code, Arules and Rattle packages are applied while verifying the output.
There are two interesting applications of association rules mining: one is multilevel and multidimensional association rules mining, while the other is constraint-based mining.
For a given transactional dataset, if there is a conceptual hierarchy that exists from some dimensions of the dataset, then we can apply multilevel association rules mining to this dataset. Any association rules mining algorithm applicable to the transaction dataset can be used for this task. The following table shows an example from the Amazon store:
TID |
Item purchased |
---|---|
1 |
Dell Venue 7 16 GB Tablet, HP Pavilion 17-e140us 17.3-Inch Laptop... |
2 |
Samsung Galaxy Tab 3 Lite, Razer Edge Pro 256GB Tablet… |
2 |
Acer C720P-2666 Chromebook, Logitech Wireless Combo MK270 with Keyboard and Mouse… |
2 |
Toshiba CB35-A3120 13.3-Inch Chromebook, Samsung Galaxy Tab 3 (7-Inch, White)… |
Have a look at the following flowchart that explains multilevel pattern mining:
Based on the conceptual hierarchy, lower-level concepts can be projected to higher-level concepts, and the new dataset with higher-level concepts can replace the original lower-level concepts.
The support counts are calculated at each conceptual level. Many A-Priori-like algorithms are designed with slightly different treatment to support count; here is a possible list of treatments available:
- A uniform minimum support threshold is used across all the levels
- Reduced minimum support threshold is used for lower levels
- Group-based minimum support threshold
Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.
The ordinary constraints are, but not limited to, the following:
- Knowledge-type constraint (specifies what we are going to mine)
- Data constraint (limits to the original dataset)
- Dimension-level constraints
- Interestingness constraints
- Rule constraints
Mining multilevel and multidimensional association rules
For a given transactional dataset, if there is a conceptual hierarchy that exists from some dimensions of the dataset, then we can apply multilevel association rules mining to this dataset. Any association rules mining algorithm applicable to the transaction dataset can be used for this task. The following table shows an example from the Amazon store:
TID |
Item purchased |
---|---|
1 |
Dell Venue 7 16 GB Tablet, HP Pavilion 17-e140us 17.3-Inch Laptop... |
2 |
Samsung Galaxy Tab 3 Lite, Razer Edge Pro 256GB Tablet… |
2 |
Acer C720P-2666 Chromebook, Logitech Wireless Combo MK270 with Keyboard and Mouse… |
2 |
Toshiba CB35-A3120 13.3-Inch Chromebook, Samsung Galaxy Tab 3 (7-Inch, White)… |
Have a look at the following flowchart that explains multilevel pattern mining:
Based on the conceptual hierarchy, lower-level concepts can be projected to higher-level concepts, and the new dataset with higher-level concepts can replace the original lower-level concepts.
The support counts are calculated at each conceptual level. Many A-Priori-like algorithms are designed with slightly different treatment to support count; here is a possible list of treatments available:
- A uniform minimum support threshold is used across all the levels
- Reduced minimum support threshold is used for lower levels
- Group-based minimum support threshold
Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.
The ordinary constraints are, but not limited to, the following:
- Knowledge-type constraint (specifies what we are going to mine)
- Data constraint (limits to the original dataset)
- Dimension-level constraints
- Interestingness constraints
- Rule constraints
Constraint-based frequent pattern mining
Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.
The ordinary constraints are, but not limited to, the following:
- Knowledge-type constraint (specifies what we are going to mine)
- Data constraint (limits to the original dataset)
- Dimension-level constraints
- Interestingness constraints
- Rule constraints
Sequential pattern mining is the major task for sequence dataset mining. The A-Priori-life algorithm is used to mine sequence patterns that use the A-Priori-life algorithm, which applies a breath-first strategy. However, for the pattern-growth method, a depth-first strategy is used instead. The algorithm sometimes integrates with constraints for various reasons.
The common purchase patterns of the customers of the store can be mined from sequential patterns. In other aspects, especially advertisement or market campaign, sequential patterns play an important role. The individual customer's behavior can be predicted from sequential patterns in the domain of web log mining, web page recommendation system, bioinformatics analysis, medical treatment sequence track and analysis, and disaster prevention and safety management.
The rules in this chapter, which are mined from sequence patterns, are of many types. Some of them are listed as follows:
- A sequential rule is , where
- A label sequential rule (LSR) is of the form , where Y is a sequence, and X a sequence generated from Y by replacing some of its items with wildcards
- A class sequential rule (CSR) is defined as X if:
A sequence dataset S is defined as a set of tuples, (sid, s), in which sid is a sequence ID, and s is a sequence.
The support of a sequence X in a sequence dataset S is the number of tuples in S, which contains X
: .
Here is a property intrinsic to sequential patterns, and it is applied to related algorithms such as the A-Priori property for the A-Priory algorithm. For a sequence X
and its subsequence Y
, .
The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:
- GSP is an extension of the A-Priori algorithm
It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.
- The features require multiple passes of the initial transaction dataset
- It uses the horizontal data format
- In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
- In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass
The overview of GSP algorithm goes here.
Sequence dataset
A sequence dataset S is defined as a set of tuples, (sid, s), in which sid is a sequence ID, and s is a sequence.
The support of a sequence X in a sequence dataset S is the number of tuples in S, which contains X
: .
Here is a property intrinsic to sequential patterns, and it is applied to related algorithms such as the A-Priori property for the A-Priory algorithm. For a sequence X
and its subsequence Y
, .
The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:
- GSP is an extension of the A-Priori algorithm
It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.
- The features require multiple passes of the initial transaction dataset
- It uses the horizontal data format
- In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
- In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass
The overview of GSP algorithm goes here.
The GSP algorithm
The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:
- GSP is an extension of the A-Priori algorithm
It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.
- The features require multiple passes of the initial transaction dataset
- It uses the horizontal data format
- In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
- In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass
The overview of GSP algorithm goes here.
Here is the R source code of the main algorithm:
GSP <- function (d,I,MIN_SUP){ f <- NULL c[[1]] <- CreateInitalPrefixTree(NULL) len4I <- GetLength(I) for(idx in 1:len4I){ SetSupportCount(I[idx],0) AddChild2Node(c[[1]], I[idx],NULL) } k <- 1 while( !IsEmpty(c[[k]]) ){ ComputeSupportCount(c[[k]],d) while(TRUE){ r <- GetLeaf(c[[k]]) if( r==NULL ){ break } if(GetSupport(r)>=MIN_SUP){ AddFrequentItemset(f,r,GetSupport(r)) }else{ RemoveLeaf(c[[k]],s) } } c[[k+1]] <- ExtendPrefixTree(c[[k]]) k <- K+1 } f }
Sequential Pattern Discovery using Equivalent classes (SPADE) is a vertical sequence-mining algorithm applied to sequence patterns; it has a depth-first approach. Here are the features of the SPADE algorithm:
- SPADE is an extension of the A-Priori algorithm
- It uses the A-Priori property
- Multiple passes of the initial transaction data set are required
- The vertical data format is used
- It uses a simple join operation
- All sequences are found in three dataset passes
The short description of SPADE algorithm goes here.
Here is the pseudocode before calling the SPADE algorithm, :
Here is the R source code of the main algorithm:
SPADE <- function (p,f,k,MIN_SUP){ len4p <- GetLength(p) for(idx in 1:len4p){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in 1:len4p){ xab <- CreateTidSets(p[[idx]],p[[jdx]],k) if(GetSupport(xab)>=MIN_SUP){ AddFrequentTidSets(pa,xab) } } if(!IsEmptyTidSets(pa)){ SPADE(p,f,k+1,MIN_SUP) } } }
The SPADE algorithm
Sequential Pattern Discovery using Equivalent classes (SPADE) is a vertical sequence-mining algorithm applied to sequence patterns; it has a depth-first approach. Here are the features of the SPADE algorithm:
- SPADE is an extension of the A-Priori algorithm
- It uses the A-Priori property
- Multiple passes of the initial transaction data set are required
- The vertical data format is used
- It uses a simple join operation
- All sequences are found in three dataset passes
The short description of SPADE algorithm goes here.
Here is the pseudocode before calling the SPADE algorithm, :
Here is the R source code of the main algorithm:
SPADE <- function (p,f,k,MIN_SUP){ len4p <- GetLength(p) for(idx in 1:len4p){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in 1:len4p){ xab <- CreateTidSets(p[[idx]],p[[jdx]],k) if(GetSupport(xab)>=MIN_SUP){ AddFrequentTidSets(pa,xab) } } if(!IsEmptyTidSets(pa)){ SPADE(p,f,k+1,MIN_SUP) } } }
The R implementation
Here is the R source code of the main algorithm:
SPADE <- function (p,f,k,MIN_SUP){ len4p <- GetLength(p) for(idx in 1:len4p){ AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]])) Pa <- GetFrequentTidSets(NULL,MIN_SUP) for(jdx in 1:len4p){ xab <- CreateTidSets(p[[idx]],p[[jdx]],k) if(GetSupport(xab)>=MIN_SUP){ AddFrequentTidSets(pa,xab) } } if(!IsEmptyTidSets(pa)){ SPADE(p,f,k+1,MIN_SUP) } } }
Along with the growth of the dataset size, there is a steady requirement for high-performance associations/patterns mining algorithms.
With the introduction of Hadoop and other MapReduce-like platforms to the world, there is a chance to meet these requirements. I will discuss this further in the upcoming chapters. Depending on the size of the dataset, some algorithms should be revised and adjusted, such as the recursive algorithm that will eventually run out of space on the call stack and might present a challenge when converting to MapReduce.
To enhance your knowledge about this chapter, here are some practice questions that'll let you understand the concepts better:
- Write an R program to find how many unique items' names are contained in the given sample market basket transaction file. Map each item's name to a unique integer ID. Find out all the closed frequent itemsets. Find out all the maximal frequent itemsets and their support count. Set a support count threshold to various values yourself.
- Write an R program to implement the A-PrioriTid algorithm.
In this chapter, we looked at the following topics:
- Market basket analysis
- As the first step of association rule mining, the frequent itemset is the key factor. Along with the algorithm design, closed itemsets and maximum frequent itemsets are defined too.
- As the target of association rule mining, association rules are mined with the measure of support count and confidence. Correlation rules mining are mined with the correlation formulae, in addition to the support count.
- Monotonicity of frequent itemset; if an itemset is frequent, then all its subsets are frequent.
- The A-Priori algorithm, which is the first efficient mining algorithm to mine frequent patterns; many variants originated from it.
- Frequent patterns in sequence.
The next chapter will cover the basic classification algorithms, which is a major application of data mining, including ID3, C4.5, and CART.