Book Image

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel
Book Image

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Table of Contents (6 chapters)

Chapter 2. Mining Frequent Patterns, Associations, and Correlations

In this chapter, we will learn how to mine frequent patterns, association rules, and correlation rules when working with R programs. Then, we will evaluate all these methods with benchmark data to determine the interestingness of the frequent patterns and rules. We will cover the following topics in this chapter:

  • Introduction to associations and patterns
  • Market basket analysis
  • Hybrid association rules mining
  • Mining sequence datasets
  • High-performance algorithms

The algorithms to find frequent items from various data types can be applied to numeric or categorical data. Most of these algorithms have one common basic algorithmic form, which is A-Priori, depending on certain circumstances. Another basic algorithm is FP-Growth, which is similar to A-Priori. Most pattern-related mining algorithms derive from these basic algorithms.

With frequent patterns found as one input, many algorithms are designed to find association and correlation rules. Each algorithm is only a variation from the basic algorithm.

Along with the growth, size, and types of datasets from various domains, new algorithms are designed, such as the multistage algorithm, the multihash algorithm, and the limited-pass algorithm.

An overview of associations and patterns

One popular task for data mining is to find relations among the source dataset; this is based on searching frequent patterns from various data sources, such as market baskets, graphs, and streams.

All the algorithms illustrated in this chapter are written from scratch in the R language for the purpose of explaining association analysis, and the code will be demonstrated using the standard R packages for the algorithms such as arules.

Patterns and pattern discovery

With many applications across a broad field, frequent pattern mining is often used in solving various problems, such as the market investigation for a shopping mall from the transaction data.

Frequent patterns are the ones that often occur in the source dataset. The dataset types for frequent pattern mining can be itemset, subsequence, or substructure. As a result, the frequent patterns found are known as:

  • Frequent itemset
  • Frequent subsequence
  • Frequent substructures

These three frequent patterns will be discussed in detail in the upcoming sections.

These newly founded frequent patterns will serve as an important platform when searching for recurring interesting rules or relationships among the given dataset.

Various patterns are proposed to improve the efficiency of mining on a dataset. Some of them are as follows; they will be defined in detail later:

  • Closed patterns
  • Maximal patterns
  • Approximate patterns
  • Condensed patterns
  • Discriminative frequent patterns

The frequent itemset

The frequent itemset originated from true market basket analysis. In a store such as Amazon, there are many orders or transactions; a certain customer performs a transaction where their Amazon shopping cart includes some items. The mass result of all customers' transactions can be used by the storeowner to find out what items are purchased together by customers. As a simple definition, itemset denotes a collection of zero or more items.

We call a transaction a basket, and a set of items can belong to any basket. We will set the variable s as the support threshold, which is compared with the count of a certain set of items that appear in all the baskets. If the count of a certain set of items that appear in all the baskets is not less than s, we would call the itemset a frequent itemset.

An itemset is called a k-itemset if it contains k pieces of items, where k is a non-zero integer. The support count of an itemset is The frequent itemset, the count of itemset contained X, given the dataset.

For a predefined minimum support threshold s, the itemset X is a frequent itemset if The frequent itemset. The minimum support threshold s is a customizable parameter, which can be adjusted by domain experts or experiences.

The frequent itemset is also used in many domains. Some of them are shown in the following table:

 

Items

Baskets

Comments

Related concepts

Words

Documents

 

Plagiarism

Documents

Sentences

 

Biomarkers

Biomarkers and diseases

The set of data about a patient

 

If an itemset is frequent, then any of its subset must be frequent. This is known as the A-Priori principle, the foundation of the A-Priori algorithm. The direct application of the A-Priori principle is to prune the huge number of frequent itemsets.

One important factor that affects the number of frequent itemsets is the minimum support count: the lower the minimum support count, the larger the number of frequent itemsets.

For the purpose of optimizing the frequent itemset-generation algorithm, some more concepts are proposed:

  • An itemset X is closed in dataset S, if The frequent itemset; X is also called a closed itemset. In other words, if X is frequent, then X is a closed frequent itemset.
  • An itemset X is a maximal frequent itemset if The frequent itemset; in other words, Y does not have frequent supersets.
  • An itemset X is considered a constrained frequent itemset once the frequent itemset satisfies the user-specified constraints.
  • An itemset X is an approximate frequent itemset if X derives only approximate support counts for the mined frequent itemsets.
  • An itemset X is a top-k frequent itemset in the dataset S if X is the k-most frequent itemset, given a user-defined value k.

The following example is of a transaction dataset. All itemsets only contain items from the set, The frequent itemset.Let's assume that the minimum support count is 3.

tid (transaction id)

List of items in the itemset or transaction

T001

The frequent itemset

T002

The frequent itemset

T003

The frequent itemset

T004

The frequent itemset

T005

The frequent itemset

T006

The frequent itemset

T007

The frequent itemset

T008

The frequent itemset

T009

The frequent itemset

T010

The frequent itemset

Then, we will get the frequent itemsets The frequent itemset and The frequent itemset.

The frequent subsequence

The frequent sequence is an ordered list of elements where each element contains at least one event. An example of this is the page-visit sequence on a site by the specific web page the user is on more concretely speaking, the order in which a certain user visits web pages. Here are two examples of the frequent subsequence:

  • Customer: Successive shopping records of certain customers in a shopping mart serves as the sequence, each item bought serves as the event item, and all the items bought by a customer in one shopping are treated as elements or transactions
  • Web usage data: Users who visit the history of the WWW are treated as a sequence, each UI/page serves as the event or item, and the element or transaction can be defined as the pages visited by users with one click of the mouse

The length of a sequence is defined by the number of items contained in the sequence. A sequence of length k is called a k-sequence. The size of a sequence is defined by the number of itemsets in the sequence. We call a sequence The frequent subsequence as a subsequence of the sequence The frequent subsequence or The frequent subsequence as the super sequence of The frequent subsequence when The frequent subsequence is satisfied.

The frequent substructures

In some domains, the tasks under research can be modeled with a graph theory. As a result, there are requirements for mining common subgraphs (subtrees or sublattices); some examples are as follows:

  • Web mining: Web pages are treated as the vertices of graph, links between pages serve as edges, and a user's page-visiting records construct the graph.
  • Network computing: Any device with computation ability on the network serves as the vertex, and the interconnection between these devices serves as the edge. The whole network that is made up of these devices and interconnections is treated as a graph.
  • Semantic web: XML elements serve as the vertices, and the parent/child relations between them are edges; all these XML files are treated as graphs.

A graph G is represented by G = (V, E), where V represents a group of vertices, and E represents a group of edges. A graph The frequent substructures is called as subgraph of graph G = (V, E) once The frequent substructures and The frequent substructures. Here is an example of a subgraph. There is the original graph with vertices and edges on the left-hand side of the following figure and the subgraph on the right-hand side with some edges omitted (or omission of vertices in other circumstances):

The frequent substructures

Relationship or rules discovery

Mining of association rules is based on the frequent patterns found. The different emphases on the interestingness of relations derives two types of relations for further research: association rules and correlation rules.

Association rules

In a later section, a method to show association analysis is illustrated; this is a useful method to discover interesting relationships within a huge dataset. The relations can be represented in the form of association rules or frequent itemsets.

Association rule mining is to find the result rule set on a given dataset (the transaction data set or other sequence-pattern-type dataset), a predefined minimum support count s, and a predefined confidence c, given any found rule Association rules Association rules, and Association rules.

Association rules is an association rule where Association rules; X and Y are disjoint. The interesting thing about this rule is that it is measured by its support and confidence. Support means the frequency in which this rule appears in the dataset, and confidence means the probability of the appearance of Y when X is present.

For association rules, the key measures of rule interestingness are rule support and confidence. Their relationship is given as follows:

Association rules

support_count(X) is the count of itemset in the dataset, contained X.

As a convention, in support_count(X), in the confidence value and support count value are represented as a percentage between 0 and 100.

The association rule Association rules is strong once Association rules and Association rules. The predefined minimum support threshold is s, and c is the predefined minimum confidence threshold.

The meaning of the found association rules should be explained with caution, especially when there is not enough to judge whether the rule implies causality. It only shows the co-occurrence of the prefix and postfix of the rule. The following are the different kinds of rules you can come across:

  • A rule is a Boolean association rule if it contains association of the presence of the item
  • A rule is a single-dimensional association if there is, at the most, only one dimension referred to in the rules
  • A rule is a multidimensional association rule if there are at least two dimensions referred to in the rules
  • A rule is a correlation-association rule if the relations or rules are measured by statistical correlation, which, once passed, leads to a correlation rule
  • A rule is a quantitative-association rule if at least one item or attribute contained in it is quantitative

Correlation rules

In some situations, the support and confidence pairs are not sufficient to filter uninteresting association rules. In such a case, we will use support count, confidence, and correlations to filter association rules.

There are a lot of methods to calculate the correlation of an association rule, such as Correlation rules analyses, all-confidence analysis, and cosine. For a k-itemset Correlation rules, define the all-confidence value of X as:

Correlation rules
Correlation rules

Market basket analysis

Market basket analysis is the methodology used to mine a shopping cart of items bought or just those kept in the cart by customers. The concept is applicable to a variety of applications, especially for store operations. The source dataset is a massive data record. The aim of market basket analysis is to find the association rules between the items within the source dataset.

The market basket model

The market basket model is a model that illustrates the relation between a basket and its associated items. Many tasks from different areas of research have this relation in common. To summarize them all, the market basket model is suggested as the most typical example to be researched.

The basket is also known as the transaction set; this contains the itemsets that are sets of items belonging to same itemset.

The A-Priori algorithm is a level wise, itemset mining algorithm. The Eclat algorithm is a tidset intersection itemset mining algorithm based on tidset intersection in contrast to A-Priori. FP-growth is a frequent pattern tree algorithm. The tidset denotes a collection of zeros or IDs of transaction records.

A-Priori algorithms

As a common strategy to design algorithms, the problem is divided into two subproblems:

  • The frequent itemset generation
  • Rule generation

The strategy dramatically decreases the search space for association mining algorithms.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The market basket model

The market basket model is a model that illustrates the relation between a basket and its associated items. Many tasks from different areas of research have this relation in common. To summarize them all, the market basket model is suggested as the most typical example to be researched.

The basket is also known as the transaction set; this contains the itemsets that are sets of items belonging to same itemset.

The A-Priori algorithm is a level wise, itemset mining algorithm. The Eclat algorithm is a tidset intersection itemset mining algorithm based on tidset intersection in contrast to A-Priori. FP-growth is a frequent pattern tree algorithm. The tidset denotes a collection of zeros or IDs of transaction records.

A-Priori algorithms

As a common strategy to design algorithms, the problem is divided into two subproblems:

  • The frequent itemset generation
  • Rule generation

The strategy dramatically decreases the search space for association mining algorithms.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

A-Priori algorithms

As a common strategy to design algorithms, the problem is divided into two subproblems:

  • The frequent itemset generation
  • Rule generation

The strategy dramatically decreases the search space for association mining algorithms.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

Hybrid association rules mining

There are two interesting applications of association rules mining: one is multilevel and multidimensional association rules mining, while the other is constraint-based mining.

Mining multilevel and multidimensional association rules

For a given transactional dataset, if there is a conceptual hierarchy that exists from some dimensions of the dataset, then we can apply multilevel association rules mining to this dataset. Any association rules mining algorithm applicable to the transaction dataset can be used for this task. The following table shows an example from the Amazon store:

TID

Item purchased

1

Dell Venue 7 16 GB Tablet, HP Pavilion 17-e140us 17.3-Inch Laptop...

2

Samsung Galaxy Tab 3 Lite, Razer Edge Pro 256GB Tablet…

2

Acer C720P-2666 Chromebook, Logitech Wireless Combo MK270 with Keyboard and Mouse…

2

Toshiba CB35-A3120 13.3-Inch Chromebook, Samsung Galaxy Tab 3 (7-Inch, White)…

Have a look at the following flowchart that explains multilevel pattern mining:

Mining multilevel and multidimensional association rules

Based on the conceptual hierarchy, lower-level concepts can be projected to higher-level concepts, and the new dataset with higher-level concepts can replace the original lower-level concepts.

The support counts are calculated at each conceptual level. Many A-Priori-like algorithms are designed with slightly different treatment to support count; here is a possible list of treatments available:

  • A uniform minimum support threshold is used across all the levels
  • Reduced minimum support threshold is used for lower levels
  • Group-based minimum support threshold

Note

Sometimes the A-Priori property is not always held here. There are some exceptions.

Multilevel association rules are mined from multiple levels of the conceptual hierarchy.

Constraint-based frequent pattern mining

Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.

The ordinary constraints are, but not limited to, the following:

  • Knowledge-type constraint (specifies what we are going to mine)
  • Data constraint (limits to the original dataset)
  • Dimension-level constraints
  • Interestingness constraints
  • Rule constraints

Mining multilevel and multidimensional association rules

For a given transactional dataset, if there is a conceptual hierarchy that exists from some dimensions of the dataset, then we can apply multilevel association rules mining to this dataset. Any association rules mining algorithm applicable to the transaction dataset can be used for this task. The following table shows an example from the Amazon store:

TID

Item purchased

1

Dell Venue 7 16 GB Tablet, HP Pavilion 17-e140us 17.3-Inch Laptop...

2

Samsung Galaxy Tab 3 Lite, Razer Edge Pro 256GB Tablet…

2

Acer C720P-2666 Chromebook, Logitech Wireless Combo MK270 with Keyboard and Mouse…

2

Toshiba CB35-A3120 13.3-Inch Chromebook, Samsung Galaxy Tab 3 (7-Inch, White)…

Have a look at the following flowchart that explains multilevel pattern mining:

Mining multilevel and multidimensional association rules

Based on the conceptual hierarchy, lower-level concepts can be projected to higher-level concepts, and the new dataset with higher-level concepts can replace the original lower-level concepts.

The support counts are calculated at each conceptual level. Many A-Priori-like algorithms are designed with slightly different treatment to support count; here is a possible list of treatments available:

  • A uniform minimum support threshold is used across all the levels
  • Reduced minimum support threshold is used for lower levels
  • Group-based minimum support threshold

Note

Sometimes the A-Priori property is not always held here. There are some exceptions.

Multilevel association rules are mined from multiple levels of the conceptual hierarchy.

Constraint-based frequent pattern mining

Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.

The ordinary constraints are, but not limited to, the following:

  • Knowledge-type constraint (specifies what we are going to mine)
  • Data constraint (limits to the original dataset)
  • Dimension-level constraints
  • Interestingness constraints
  • Rule constraints

Constraint-based frequent pattern mining

Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.

The ordinary constraints are, but not limited to, the following:

  • Knowledge-type constraint (specifies what we are going to mine)
  • Data constraint (limits to the original dataset)
  • Dimension-level constraints
  • Interestingness constraints
  • Rule constraints

Mining sequence dataset

Sequential pattern mining is the major task for sequence dataset mining. The A-Priori-life algorithm is used to mine sequence patterns that use the A-Priori-life algorithm, which applies a breath-first strategy. However, for the pattern-growth method, a depth-first strategy is used instead. The algorithm sometimes integrates with constraints for various reasons.

The common purchase patterns of the customers of the store can be mined from sequential patterns. In other aspects, especially advertisement or market campaign, sequential patterns play an important role. The individual customer's behavior can be predicted from sequential patterns in the domain of web log mining, web page recommendation system, bioinformatics analysis, medical treatment sequence track and analysis, and disaster prevention and safety management.

The rules in this chapter, which are mined from sequence patterns, are of many types. Some of them are listed as follows:

  • A sequential rule is Mining sequence dataset, where Mining sequence dataset
  • A label sequential rule (LSR) is of the form Mining sequence dataset, where Y is a sequence, and X a sequence generated from Y by replacing some of its items with wildcards
  • A class sequential rule (CSR) is defined as X if:
    Mining sequence dataset

Sequence dataset

A sequence dataset S is defined as a set of tuples, (sid, s), in which sid is a sequence ID, and s is a sequence.

The support of a sequence X in a sequence dataset S is the number of tuples in S, which contains X: Sequence dataset.

Here is a property intrinsic to sequential patterns, and it is applied to related algorithms such as the A-Priori property for the A-Priory algorithm. For a sequence X and its subsequence Y, Sequence dataset.

The GSP algorithm

The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:

  • GSP is an extension of the A-Priori algorithm

    It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.

  • The features require multiple passes of the initial transaction dataset
  • It uses the horizontal data format
  • In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
  • In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass

The overview of GSP algorithm goes here.

The GSP algorithm

Here is the pseudocode:

The GSP algorithm
The GSP algorithm

Sequence dataset

A sequence dataset S is defined as a set of tuples, (sid, s), in which sid is a sequence ID, and s is a sequence.

The support of a sequence X in a sequence dataset S is the number of tuples in S, which contains X: Sequence dataset.

Here is a property intrinsic to sequential patterns, and it is applied to related algorithms such as the A-Priori property for the A-Priory algorithm. For a sequence X and its subsequence Y, Sequence dataset.

The GSP algorithm

The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:

  • GSP is an extension of the A-Priori algorithm

    It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.

  • The features require multiple passes of the initial transaction dataset
  • It uses the horizontal data format
  • In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
  • In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass

The overview of GSP algorithm goes here.

The GSP algorithm

Here is the pseudocode:

The GSP algorithm
The GSP algorithm

The GSP algorithm

The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:

  • GSP is an extension of the A-Priori algorithm

    It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.

  • The features require multiple passes of the initial transaction dataset
  • It uses the horizontal data format
  • In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
  • In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass

The overview of GSP algorithm goes here.

The GSP algorithm

Here is the pseudocode:

The GSP algorithm
The GSP algorithm

The R implementation

Here is the R source code of the main algorithm:

GSP  <- function (d,I,MIN_SUP){
  f <- NULL
  c[[1]] <- CreateInitalPrefixTree(NULL)
  len4I <- GetLength(I)
  for(idx in 1:len4I){
    SetSupportCount(I[idx],0)
    AddChild2Node(c[[1]], I[idx],NULL)
  }
  k <- 1
  while( !IsEmpty(c[[k]]) ){
     ComputeSupportCount(c[[k]],d)
     while(TRUE){
       r <- GetLeaf(c[[k]])
       if( r==NULL ){
         break
       }
       if(GetSupport(r)>=MIN_SUP){
         AddFrequentItemset(f,r,GetSupport(r))
       }else{
         RemoveLeaf(c[[k]],s)
       }
    }
    c[[k+1]] <- ExtendPrefixTree(c[[k]])
    k <- K+1
  }
  f
}

The SPADE algorithm

Sequential Pattern Discovery using Equivalent classes (SPADE) is a vertical sequence-mining algorithm applied to sequence patterns; it has a depth-first approach. Here are the features of the SPADE algorithm:

  • SPADE is an extension of the A-Priori algorithm
  • It uses the A-Priori property
  • Multiple passes of the initial transaction data set are required
  • The vertical data format is used
  • It uses a simple join operation
  • All sequences are found in three dataset passes

The short description of SPADE algorithm goes here.

Here is the pseudocode before calling the SPADE algorithm, The SPADE algorithm:

The SPADE algorithm
The SPADE algorithm

The R implementation

Here is the R source code of the main algorithm:

SPADE  <- function (p,f,k,MIN_SUP){
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
     for(jdx in 1:len4p){
       xab <- CreateTidSets(p[[idx]],p[[jdx]],k)
       if(GetSupport(xab)>=MIN_SUP){
         AddFrequentTidSets(pa,xab)
       }
     }
     if(!IsEmptyTidSets(pa)){
       SPADE(p,f,k+1,MIN_SUP)
     }
  }
}

Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

The SPADE algorithm

Sequential Pattern Discovery using Equivalent classes (SPADE) is a vertical sequence-mining algorithm applied to sequence patterns; it has a depth-first approach. Here are the features of the SPADE algorithm:

  • SPADE is an extension of the A-Priori algorithm
  • It uses the A-Priori property
  • Multiple passes of the initial transaction data set are required
  • The vertical data format is used
  • It uses a simple join operation
  • All sequences are found in three dataset passes

The short description of SPADE algorithm goes here.

Here is the pseudocode before calling the SPADE algorithm, The SPADE algorithm:

The SPADE algorithm
The SPADE algorithm

The R implementation

Here is the R source code of the main algorithm:

SPADE  <- function (p,f,k,MIN_SUP){
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
     for(jdx in 1:len4p){
       xab <- CreateTidSets(p[[idx]],p[[jdx]],k)
       if(GetSupport(xab)>=MIN_SUP){
         AddFrequentTidSets(pa,xab)
       }
     }
     if(!IsEmptyTidSets(pa)){
       SPADE(p,f,k+1,MIN_SUP)
     }
  }
}

Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

The R implementation

Here is the R source code of the main algorithm:

SPADE  <- function (p,f,k,MIN_SUP){
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
     for(jdx in 1:len4p){
       xab <- CreateTidSets(p[[idx]],p[[jdx]],k)
       if(GetSupport(xab)>=MIN_SUP){
         AddFrequentTidSets(pa,xab)
       }
     }
     if(!IsEmptyTidSets(pa)){
       SPADE(p,f,k+1,MIN_SUP)
     }
  }
}
Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

High-performance algorithms

Along with the growth of the dataset size, there is a steady requirement for high-performance associations/patterns mining algorithms.

With the introduction of Hadoop and other MapReduce-like platforms to the world, there is a chance to meet these requirements. I will discuss this further in the upcoming chapters. Depending on the size of the dataset, some algorithms should be revised and adjusted, such as the recursive algorithm that will eventually run out of space on the call stack and might present a challenge when converting to MapReduce.

Time for action

To enhance your knowledge about this chapter, here are some practice questions that'll let you understand the concepts better:

  • Write an R program to find how many unique items' names are contained in the given sample market basket transaction file. Map each item's name to a unique integer ID. Find out all the closed frequent itemsets. Find out all the maximal frequent itemsets and their support count. Set a support count threshold to various values yourself.
  • Write an R program to implement the A-PrioriTid algorithm.

Summary

In this chapter, we looked at the following topics:

  • Market basket analysis
  • As the first step of association rule mining, the frequent itemset is the key factor. Along with the algorithm design, closed itemsets and maximum frequent itemsets are defined too.
  • As the target of association rule mining, association rules are mined with the measure of support count and confidence. Correlation rules mining are mined with the correlation formulae, in addition to the support count.
  • Monotonicity of frequent itemset; if an itemset is frequent, then all its subsets are frequent.
  • The A-Priori algorithm, which is the first efficient mining algorithm to mine frequent patterns; many variants originated from it.
  • Frequent patterns in sequence.

The next chapter will cover the basic classification algorithms, which is a major application of data mining, including ID3, C4.5, and CART.