R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Free Chapter

1. Module 1

1. Warming Up

2. Mining Frequent Patterns, Associations, and Correlations

3. Classification

4. Advanced Classification

5. Cluster Analysis

6. Advanced Cluster Analysis

7. Outlier Detection

8. Mining Stream, Time-series, and Sequence Data

9. Graph Mining and Network Analysis

10. Mining Text and Web Data

A. Algorithms and Data Structures

2. Module 2

1. Data Manipulation Using In-built R Data

2. Exploratory Data Analysis with Automobile Data

3. Visualize Diamond Dataset

4. Regression with Automobile Data

5. Market Basket Analysis with Groceries Data

6. Clustering with E-commerce Data

7. Building a Retail Recommendation Engine

8. Dimensionality Reduction

9. Applying Neural Network to Healthcare Data

3. Module 3

1. Going Viral

2. Getting Started with R

3. Mining Twitter with R

4. Potentials and Pitfalls of Social Media Data

5. Social Media Mining – Fundamentals

6. Social Media Mining – Case Studies

A. Conclusions and Next Steps

Bibliography

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 3. Classification

In this chapter, you will learn the popular classification algorithms written in the R language. Empirical classifier performance and accuracy benchmarks are also included. Along with the introduction of various classification algorithms, b will also learn various ways to improve the classifier and so on.

Classification has massive applications in modern life. With the exponential growth of the information dataset, there is a need for high performance classification algorithms to judge an event/object belonging to a predefined categories set. Such algorithms have unlimited opportunity for implementation in a wide variety of industries such as bioinformatics, cybercrime, and banking. Successful classification algorithms use predefined categories from training information datasets to predict the unknown category for a single event given a common set of features.

Along with the continual growth of computer science, the classification algorithms need to be implemented on many diverse platforms including distributed infrastructure, cloud environment, real-time devices, and parallel computing systems.

In this chapter, we will cover the following topics:

Classification
Generic decision tree introduction
High-value credit card customers classification using ID3
Web spam detection using C4.5
Web key resource page judgment using CART
Trojan traffic identification method and Bayes classification
Spam e-mail identification and Naïve Bayes classification
Rule-based classification and the player types in computer games

Classification

Given a set of predefined class labels, the task of classification is to assign each data object of the input dataset with a label using the classifier's training model. Typically, the input can be a discrete or continuous value, but the output is discrete binary or nominal value and so forth. Classification algorithms are often described as learning models or functions, in which x is a tuple of attribute set with discrete or continuous value, and y is an attribute with discrete value such as categorical labels.

This function can also be treated as a classification model. It can be used to distinguish objects belonging to different classes or to predict the class of a new tuple or y in the above (x, y). In another point of view, classification algorithms are targeted to find a model from the input data, and apply this model to future classification usage predictions when given a common set of attributes.

Generally speaking, Classification is a set of attributes selected as the input for the classification system. There are special algorithms used to select only the useful attributes from this set to ensure the efficiency of the classification system.

Almost any classification tasks need this preprocessing procedure, but the exact means vary from case to case. Here are three mainstream methods applied:

Data cleaning
Relevance analysis
Data transformation and reduction

A standard classification process often includes two steps. The classification model with the higher accepted accuracy is accepted as classifier to classify a dataset in production. The following two steps are illustrated with an example in the diagram:

Training (supervised learning): The classification model is built upon the training dataset, that is, the (instance, class label) pairs
Classification validation: The accuracy of the model is checked with the test dataset to decide whether to accept the model

In the following sections, we will introduce some classification algorithms with different designs.

Generic decision tree induction

There are various definitions of the term decision tree. Most commonly, a decision tree provides a representation of the process of judging the class of a given data instance or record from the root node down to some leaf node. As a major classification model, the decision tree induction builds a decision tree as a classification model using the input dataset and class label pairs. A decision tree can be applied to various combinations of the following attribute data types, but is not limited to, including nominal valued, categorical, numeric and symbolic data, and their mixture. The following list is an illustration of Hunt's decision tree definition. The Step #7 applies a selected attribute test condition to partition the records to smaller datasets.

The decision tree is popular for its simplicity and low computational effort compared to other algorithms. Here are some characteristics of the decision tree induction:

The greedy strategy is usually applied to the decision tree.
It infers a decision tree once upon the entire training dataset.
The algorithm requires no parameters to obtain the classification model from the input dataset.
Like many other tasks, finding an optimal decision tree is an NP-complete problem.
The algorithm to build the decision tree enables construction of the decision tree quickly. Tree construction is efficient, even upon large datasets.
It provides an expressive way for discrete-valued functions.
It is robust while opposed to noise.
Using a top-down, recursive partition, the divide-and-conquer strategy is applied to most of the decision tree algorithms.
The size of the sample dataset usually decreases dramatically when traversed down the tree.
A subtree can be replicated many times in the decision tree.
The test condition usually contains only one attribute.
The performance of decision tree algorithms is affected by the impurity measure.

It is time to consider the decision tree when the instances in the source dataset are describable by the attribute-value pair, and the target function has discrete values, while the training dataset possibly has some noise.

An example of a decision tree built with the input dataset in a table (classical play golf dataset) is shown in the following diagram. The decision tree is composed of three entities or concepts: the root node, the internal node, and the leaf node. The leaf is given a class label. Nodes other than the leaf conduct tests on the attribute set to determine which input data belongs to which branch (child) of the node.

Given a built decision tree, a test record can be classified easily. From the root node, apply the test condition of that node to the test record and go to the next node with the corresponding test result until a leaf, by which we can decide which class the test record belongs to, is reached.

Now there will be two issues. One is how to divide the training set as per a certain node while the decision induction tree grows according to a chosen test condition upon various attribute sets. This will be a question related to attribute selection measure, which are illustrated in the following section. The second but important issue is related to model overfitting.

There are two strategies for the termination of the growth of the limiting decision induction tree node. Using the naïve strategy, for a certain node, when all the data objects within a node are assigned to it belong to the same class or all records with the same attribute values; as a result, the node related will be assigned with the class label as the majority of training records within that node. The second strategy terminates the algorithm earlier, which is meant to avoid model overfitting and will be introduced in the tree pruning section.

Attribute selection measures

The node can have more than two children or branches depending on the attribute test condition and the selected attribute. To split the node, attribute selection measures with various implementations are applied. Attribute selection measures within the same node may also vary for binary branches or multiway branches. Some common attribute selection measures are the following:

Entropy: This concept is used in information theory to describe the impurity of an arbitrary collection of data. Given the target attribute class set with size of c, and as the proportion/probability of S belonging to class i, the definition is here, and the definition Gain is shown in the next point. Entropy always means how disordered a dataset is. The higher the value of entropy, the more the uncertainty shown by the source dataset.
The size and coverage of the training set assigned to a certain node affect the correctness of the following equations. The gain is better for those situations.
Gain:
Gain Ratio: This is applied in the C4.5 classification algorithm using the following formula:
Information Gain: The ID3 algorithm uses this statistical property to decide which attribute is selected to be tested at any node in the tree, and measures the association between inputs and outputs of the decision tree.
With the concept of information gain, the definition of a decision tree can be thought of in this way:
- A decision tree is a tree-structured plan that uses a set of attribute tests to predict output
- To decide which attribute should be tested first, simply find the one with the highest information gain
- It, then, recurs
Gini Index: It is used in the CART classification algorithm. The Gini index for a specific split point is calculated using the following equation. It is used to gauge the purity of the split point.
Split Info:

Tree pruning

The initial decision tree is often built with many branches reflecting outliers or noise, which are also common causes of model overfitting. Usually, the direct consequent in tree pruning is needed for the after-dealt of decision tree aiming, which is required for classifying higher accuracy or lower error rates. The two types of pruning in production are as follows:

Post-pruning: This approach is to perform tree pruning after the tree grows to the maximum form. The cost-complexity pruning algorithm used in CART and the pessimistic pruning in C4.5 are both examples of post-pruning.
Pre-pruning: This is also known as the early stopping strategy, which avoids an over-matured tree generation, to stop the growth of a tree earlier using additional restrictions, such as a threshold.

Repetition and replication are the two major factors that make decision trees unreasonably large and inefficient.

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

Another variety of the algorithm is described here, with input parameters as follows:

D denotes the training data
The leaf size is defined by
The leaf purity threshold is defined by

The output of the algorithm is a decision tree, as shown in the following screenshot:

Line 1 denotes the partition size. Line 4 denotes the stopping condition. Line 9 through line 17 try to get the two branches with the new split. Finally, line 19 applies the algorithm recursively on the two new subbranches to build the subtree. This algorithm is implemented with R in the following section.

The R implementation

The main function of the R code for the generic decision tree induction is listed as follows. Here data is the input dataset, c is the set of class labels, x is the set of attributes, and yita and pi have the same definitions as in the previous pseudocodes:

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

One sample dataset is chosen to verify the generic decision tree induction algorithm, the weather dataset. It is from the R package Rattle, which contains 366 examples of 23 attributes, and one target or the class label. In the R language, weather is a data frame, which contains 366 observations of 24 variables. The details for the dataset can be retrieved with the following R code:

> Library(rattle)
> str(weather)

Attribute selection measures

Entropy: This concept is used in information theory to describe the impurity of an arbitrary collection of data. Given the target attribute class set with size of c, and as the proportion/probability of S belonging to class i, the definition is here, and the definition Gain is shown in the next point. Entropy always means how disordered a dataset is. The higher the value of entropy, the more the uncertainty shown by the source dataset.
The size and coverage of the training set assigned to a certain node affect the correctness of the following equations. The gain is better for those situations.
Gain:
Gain Ratio: This is applied in the C4.5 classification algorithm using the following formula:
Information Gain: The ID3 algorithm uses this statistical property to decide which attribute is selected to be tested at any node in the tree, and measures the association between inputs and outputs of the decision tree.
With the concept of information gain, the definition of a decision tree can be thought of in this way:
- A decision tree is a tree-structured plan that uses a set of attribute tests to predict output
- To decide which attribute should be tested first, simply find the one with the highest information gain
- It, then, recurs
Gini Index: It is used in the CART classification algorithm. The Gini index for a specific split point is calculated using the following equation. It is used to gauge the purity of the split point.
Split Info:

Tree pruning

Post-pruning: This approach is to perform tree pruning after the tree grows to the maximum form. The cost-complexity pruning algorithm used in CART and the pessimistic pruning in C4.5 are both examples of post-pruning.
Pre-pruning: This is also known as the early stopping strategy, which avoids an over-matured tree generation, to stop the growth of a tree earlier using additional restrictions, such as a threshold.

Repetition and replication are the two major factors that make decision trees unreasonably large and inefficient.

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

Another variety of the algorithm is described here, with input parameters as follows:

D denotes the training data
The leaf size is defined by
The leaf purity threshold is defined by

The output of the algorithm is a decision tree, as shown in the following screenshot:

The R implementation

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

> Library(rattle)
> str(weather)

Tree pruning

Post-pruning: This approach is to perform tree pruning after the tree grows to the maximum form. The cost-complexity pruning algorithm used in CART and the pessimistic pruning in C4.5 are both examples of post-pruning.
Pre-pruning: This is also known as the early stopping strategy, which avoids an over-matured tree generation, to stop the growth of a tree earlier using additional restrictions, such as a threshold.

Repetition and replication are the two major factors that make decision trees unreasonably large and inefficient.

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

Another variety of the algorithm is described here, with input parameters as follows:

D denotes the training data
The leaf size is defined by
The leaf purity threshold is defined by

The output of the algorithm is a decision tree, as shown in the following screenshot:

The R implementation

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

> Library(rattle)
> str(weather)

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

Another variety of the algorithm is described here, with input parameters as follows:

D denotes the training data
The leaf size is defined by
The leaf purity threshold is defined by

The output of the algorithm is a decision tree, as shown in the following screenshot:

The R implementation

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

> Library(rattle)
> str(weather)

The R implementation

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

> Library(rattle)
> str(weather)

High-value credit card customers classification using ID3

The Iterative Dichotomiser 3 (ID3) algorithm is one of the most popular designs of the decision induction tree. It is not tolerant of missing values or noisy, and the value of attributes must come from an infinite fixed set.

ID3 uses entropy to calculate the homogeneity of a sample and also for the split. The information gain G for each attribute A is computed using the following equation. The root of the final tree is assigned with an attribute with the highest information gain. Then the new subtree is built recursively upon each value of the attribute bound to the root.

High-value credit card customers classification using ID3

Note

With the play golf dataset as the input dataset, you can calculate the information gain and list using the following formulas:

Entropy (root) = 0.940
Gain () = 0.048, Gain (S, Humidity) = 0.151
Gain (S, Temperature) = 0.029, Gain (S, Outlook) = 0.246

ID3 (C4.5 and CART) builds the decision induction tree recursively in a top-down divide-and-conquer manner through the space of possible decision trees with a greedy strategy. Using the greedy search strategy, at each step, a decision that greatly improves the optimizing target is made. For each node, find the test condition best segment the training data assigned to it.

The characteristics of the decision induction tree in the case of ID3 include the following:

Each node excluding the leaf of the tree corresponds to an input attribute, each arc to a possible value of that attribute
Entropy is used to determine how informative a particular input attribute is about the output class on a given dataset
The recursive algorithm
Note
A quick description about the recursive algorithm can be defined as follows:
- Break the original problem into two or more smaller-sized problems with the same type
- Call the recursive algorithm on each smaller type of problem
- Group together the results of step 2 to solve the original problem

The ID3 algorithm

The input parameters for ID3 algorithm are as follows:

I, denotes the set of input attributes, which may be tested by the result decision tree
T, the set of training data objects, or training examples

The output parameter of the algorithm is as follows:

O, denotes the set of output attribute, that is, the value of those attributes will be predicted by the tree

Here is the pseudocode of the general algorithm:

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.

Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.

Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.

Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.

The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:

DOS: This refers to denial of service
R2L: This refers to unauthorized access to the local machine from the remote machine
U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
Probing: This refers to surveillance and probing

By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

SQL Injection
Cross Site Scripting
Code Injection
Directory Traversal

All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

The ID3 algorithm

The input parameters for ID3 algorithm are as follows:

I, denotes the set of input attributes, which may be tested by the result decision tree
T, the set of training data objects, or training examples

The output parameter of the algorithm is as follows:

O, denotes the set of output attribute, that is, the value of those attributes will be predicted by the tree

Here is the pseudocode of the general algorithm:

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

DOS: This refers to denial of service
R2L: This refers to unauthorized access to the local machine from the remote machine
U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
Probing: This refers to surveillance and probing

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

SQL Injection
Cross Site Scripting
Code Injection
Directory Traversal

High-value credit card customers classification

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

DOS: This refers to denial of service
R2L: This refers to unauthorized access to the local machine from the remote machine
U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
Probing: This refers to surveillance and probing

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

SQL Injection
Cross Site Scripting
Code Injection
Directory Traversal

High-value credit card customers classification

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

Web attack detection

DOS: This refers to denial of service
R2L: This refers to unauthorized access to the local machine from the remote machine
U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
Probing: This refers to surveillance and probing

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

SQL Injection
Cross Site Scripting
Code Injection
Directory Traversal

High-value credit card customers classification

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

High-value credit card customers classification

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

Web spam detection using C4.5

C4.5 is an extension of ID3. The major extensions include handling data with missing attribute values, and handling attributes that belong to an infinite continuous range.

It is one of the decision tree algorithms, and is also a supervised learning classification algorithm. A model is learned and the input attribute values are mapped to the mutually exclusive class labels. Moreover, the learned model will be used to further classify new unseen instances or attribute values. The attribute select measure adopted in C4.5 is the gain ratio, which avoids the possible bias:

Based on the generic C4.5 algorithm, a suite for varieties derived, C4.5, C4.5-no-pruning, C4.5-rules, and so forth; all of them are called C4.5 algorithms, which means C4.5 is a suite of algorithms.

Compared to other algorithms, there are many important characteristics of C4.5:

Tree pruning, a post-pruning strategy, is followed to remove some of the tree structure using selected accuracy criteria
An improved use of continuous attributes
Missing values handling
Inducing rule sets
It contains a multiway test that depends on the attribute value and is not just limited to the binary test
Information theoretic tests are applied via the gain and gain ratio
The greedy learning algorithm, that is, along with the tree growing, test best criteria test result is chosen
The data fits in main memory (there are many extended algorithms that can use the secondary storages, such as BOAT, Rainforest, SLIQ, SPRINT, and so forth.)

The C4.5 algorithm

Here is the pseudocode for the basic C4.5 algorithm:

The R implementation

The R code for the ID3 is listed as follows:

  1 C45 <- function(data,x){
  2     result.tree <- NULL
  3 
  4     if( IsEmpty(data) ){
  5         node.value <- "Failure"
  6         result.tree <- CreateNode(node.value)
  7         return(result.tree)
  8     }
  9     if( IsEmpty(x) ){
 10         node.value <- GetMajorityClassValue(data,x)
 11         result.tree <- CreateNode(node.value)
 12         return(result.tree)
 13     }
 14     if( 1 == GetCount(x) ){
 15         node.value <- GetClassValue(x)
 16         result.tree <- CreateNode(node.value)
 17         return(result.tree)
 18     }
 19 
 20     gain.ratio <- GetGainRatio(data,x)
 21     best.split <- GetBestSplit(data,x,gain.ratio)
 22 
 23     data.subsets <- SplitData(data,best.split)
 24     values <- GetAttributeValues(data.subsets,best.split)
 25     values.count <- GetCount(values)
 26 
 27     node.value <- best.split
 28     result.tree <- CreateNode(node.value)
 29     idx <- 0
 30     while( idx<=values.count ){
 31         idx <- idx+1
 32         newdata <- GetAt(data.subsets,idx)
 33         value <- GetAt(values,idx)
 34         new.x <- RemoveAttribute(x,best.split)
 35         new.child <- C45(newdata,new.x)
 36         AddChildNode(result.tree,new.child,value)
 37     }
 38 
 39     result.tree
 40 }

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

Web spam detection

Spamming occurs along with the emergence of search engine technologies to pursue higher rank with the deceiving search engines relevancy algorithm, but not improve their own website technologies. It performs deliberate actions to trigger unjustifiable favorable relevance or importance for a specific web page, in contrast to the page's true value or merit. The spam page finally receives a substantial amount of score from other spam pages to boost its rank in search engine results by deliberately manipulating the search engine indexes. Finally, traffic is driven to the spammed pages. As a direct result of the Web spam, the information quality of the Web world is degraded, user experience is manipulated, and the security risk for use increases due to exploitation of user privacy.

One classic example, denoted as link-farm, is illustrated in the following diagram, in which a densely connected set of pages is created with the target of cheating a link-based ranking algorithm, which is also called collusion:

There are three major categories of spam from a business point of view:

Page Spoofing
Browser-Based Attacks
Search Engine Manipulations

There are three major categories of spam from a technology point of view:

Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.

The link-based spam detections usually rely on automatic classifiers, detecting the anomalous behavior of a link-based ranking algorithm, and so on. Classifier or language model disagreement can be adopted to detect content spam. While the cloaking detection solution is inherent, one of them is comparing the indexed page with the pages visitors saw.

How to apply a decision tree for web spam detection? The links and contents of the web spam, after statistical analysis, are unique compared to other normal pages. Some properties are valuable for detecting spam, trustworthiness of the page, neutrality (facts), and bias. Web spam detection can be a good example for illustrating the C4.5 algorithm.

Some domain-related knowledge can be applied to the classification solution. One observed phenomenon is that bad links always have links between them. The links between web pages and websites are often not randomly set but with certain rules and they can be detected by classifiers.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

All these features are included in one data point to set up a preprocessed dataset, which in turn can be used by classification algorithms, especially decision tree algorithms such as C4.5 to work as web spam classifiers to distinguish spam from normal pages. Among all the classification solutions, C4.5 achieves the best performance.

The C4.5 algorithm

Here is the pseudocode for the basic C4.5 algorithm:

The R implementation

The R code for the ID3 is listed as follows:

  1 C45 <- function(data,x){
  2     result.tree <- NULL
  3 
  4     if( IsEmpty(data) ){
  5         node.value <- "Failure"
  6         result.tree <- CreateNode(node.value)
  7         return(result.tree)
  8     }
  9     if( IsEmpty(x) ){
 10         node.value <- GetMajorityClassValue(data,x)
 11         result.tree <- CreateNode(node.value)
 12         return(result.tree)
 13     }
 14     if( 1 == GetCount(x) ){
 15         node.value <- GetClassValue(x)
 16         result.tree <- CreateNode(node.value)
 17         return(result.tree)
 18     }
 19 
 20     gain.ratio <- GetGainRatio(data,x)
 21     best.split <- GetBestSplit(data,x,gain.ratio)
 22 
 23     data.subsets <- SplitData(data,best.split)
 24     values <- GetAttributeValues(data.subsets,best.split)
 25     values.count <- GetCount(values)
 26 
 27     node.value <- best.split
 28     result.tree <- CreateNode(node.value)
 29     idx <- 0
 30     while( idx<=values.count ){
 31         idx <- idx+1
 32         newdata <- GetAt(data.subsets,idx)
 33         value <- GetAt(values,idx)
 34         new.x <- RemoveAttribute(x,best.split)
 35         new.child <- C45(newdata,new.x)
 36         AddChildNode(result.tree,new.child,value)
 37     }
 38 
 39     result.tree
 40 }

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

Web spam detection

There are three major categories of spam from a business point of view:

Page Spoofing
Browser-Based Attacks
Search Engine Manipulations

There are three major categories of spam from a technology point of view:

Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

The R implementation

The R code for the ID3 is listed as follows:

  1 C45 <- function(data,x){
  2     result.tree <- NULL
  3 
  4     if( IsEmpty(data) ){
  5         node.value <- "Failure"
  6         result.tree <- CreateNode(node.value)
  7         return(result.tree)
  8     }
  9     if( IsEmpty(x) ){
 10         node.value <- GetMajorityClassValue(data,x)
 11         result.tree <- CreateNode(node.value)
 12         return(result.tree)
 13     }
 14     if( 1 == GetCount(x) ){
 15         node.value <- GetClassValue(x)
 16         result.tree <- CreateNode(node.value)
 17         return(result.tree)
 18     }
 19 
 20     gain.ratio <- GetGainRatio(data,x)
 21     best.split <- GetBestSplit(data,x,gain.ratio)
 22 
 23     data.subsets <- SplitData(data,best.split)
 24     values <- GetAttributeValues(data.subsets,best.split)
 25     values.count <- GetCount(values)
 26 
 27     node.value <- best.split
 28     result.tree <- CreateNode(node.value)
 29     idx <- 0
 30     while( idx<=values.count ){
 31         idx <- idx+1
 32         newdata <- GetAt(data.subsets,idx)
 33         value <- GetAt(values,idx)
 34         new.x <- RemoveAttribute(x,best.split)
 35         new.child <- C45(newdata,new.x)
 36         AddChildNode(result.tree,new.child,value)
 37     }
 38 
 39     result.tree
 40 }

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

Web spam detection

There are three major categories of spam from a business point of view:

Page Spoofing
Browser-Based Attacks
Search Engine Manipulations

There are three major categories of spam from a technology point of view:

Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

Web spam detection

There are three major categories of spam from a business point of view:

Page Spoofing
Browser-Based Attacks
Search Engine Manipulations

There are three major categories of spam from a technology point of view:

Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

Web spam detection

There are three major categories of spam from a business point of view:

Page Spoofing
Browser-Based Attacks
Search Engine Manipulations

There are three major categories of spam from a technology point of view:

Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

Web key resource page judgment using CART

Classification and Regression Trees (CART) is one of the most popular decision tree algorithms. It is a binary recursive partitioning algorithm that can be used to process continuous and nominal attributes.

There are three main steps in the CART algorithm. The first is to construct the maximum tree (binary tree). The second step is to choose the right size of the tree. The last step is to classify new data using the result tree.

Compared to other algorithms, there are many important characteristics of CART:

Binary decision tree (a binary recursive partitioning process)
The source dataset can have continuous or nominal attributes
No stopping rule (unless no possible splits are available)
Tree pruning with cost-complexity pruning
Nonparametric
No variables to be selected in advance
The missing value is dealt with an adaptive and better strategy
The outlier can be easily handled
No assumptions
Computationally fast
At each split point, only one variable is used
Only one optimal tree is taken as the result tree, which is formed from a sequence of nested pruned candidate trees generated by CART
Automatically handling the missing value in the source dataset

The weighted Gini index equation is defined as follows:

Web key resource page judgment using CART

The CART measure is different; the goodness of the split point is proportional to the value of measure. The higher the better.

The CART algorithm

Splitting rules, with the omission of the following parts, is too lengthy to include in this section:

Separate handling of continuous and categorical splitters
Special handling for categorical splitters with multiple levels
Missing value handling
Tree pruning
Tree selection

Here is the pseudocode for the CART algorithm, the simplified tree-growing algorithm:

The simplified pruning algorithm is as follows:

The R implementation

Please look up the R codes file ch_02_cart.R from the bundle of R codes for the previously mentioned algorithms. One example is chosen to apply the CART algorithm to, in the following section.

Web key resource page judgment

Web key resource judgment arises from the domain of web information retrieval and web search engines. The original concept is from the authority value, the hub value, and the HITS algorithm. During queries of information from IR systems or search engines, finding important and related information from an overwhelmingly increasing volume of information is a challenging task. A better judgment leads to less indexing storage and a more informative querying result.

A key resource page is a high quality web page with much more information per selected topic compared to an ordinary web page on the same topic. In order to measure the importance of a certain web page, feature selection is the first thing required in the design.

The link-based features used in current search technologies can't resolve such issues at an acceptable accuracy rate. To improve the accuracy rate, more global information across many data instances can be adopted in addition to the single-data-instance-related attributes or features, which means local attributes.

Experimental results show that the key web page should contain in-site out-links with anchor text to other pages. Non-content attributes, such as web page links related attributes and content structures of pages, can be applied to judge key resource pages. The possible attributes are listed as follows:

In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
In-site out-link number: This refers to the number of links embedded in a page.
Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

The CART algorithm

Splitting rules, with the omission of the following parts, is too lengthy to include in this section:

Separate handling of continuous and categorical splitters
Special handling for categorical splitters with multiple levels
Missing value handling
Tree pruning
Tree selection

Here is the pseudocode for the CART algorithm, the simplified tree-growing algorithm:

The simplified pruning algorithm is as follows:

The R implementation

Please look up the R codes file ch_02_cart.R from the bundle of R codes for the previously mentioned algorithms. One example is chosen to apply the CART algorithm to, in the following section.

Web key resource page judgment

In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
In-site out-link number: This refers to the number of links embedded in a page.
Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

The R implementation

Please look up the R codes file ch_02_cart.R from the bundle of R codes for the previously mentioned algorithms. One example is chosen to apply the CART algorithm to, in the following section.

Web key resource page judgment

In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
In-site out-link number: This refers to the number of links embedded in a page.
Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

Web key resource page judgment

In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
In-site out-link number: This refers to the number of links embedded in a page.
Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

Trojan traffic identification method and Bayes classification

Among probabilistic classification algorithms is the Bayes classification, which is based on Bayes' theorem. It predicts the instance or the class as the one that makes the posterior probability maximal. The risk for Bayes classification is that it needs enough data to estimate the joint probability density more reliably.

Given a dataset D with a size n, each instance or point x belonging to D with a dimension of m, for each Trojan traffic identification method and Bayes classification . To predict the class of any x, we use the following formula:

Basing on Bayes' theorem, Trojan traffic identification method and Bayes classification is the likelihood:

Then we get the following new equations for predicting Trojan traffic identification method and Bayes classification for x:

Estimating

With new definitions to predict a class, the prior probability and its likelihood needs to be estimated.

Prior probability estimation

Given the dataset D, if the number of instances in D labeled with class Prior probability estimation is and the size of D is n, we get the estimation for the prior probability of the class as follows:

Likelihood estimation

For numeric attributes, assuming all attributes are numeric, here is the estimation equation. One presumption is declared: each class Likelihood estimation is normally distributed around some mean with the corresponding covariance matrix . is used to estimate , for :

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

Estimating

With new definitions to predict a class, the prior probability and its likelihood needs to be estimated.

Prior probability estimation

Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

Prior probability estimation

Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

Identify spam e-mail and Naïve Bayes classification

The Naïve Bayes classification presumes that all attributes are independent; it simplifies the Bayes classification and doesn't need the related probability computation. The likelihood can be defined with the following equation:

Some of the characteristics of the Naïve Bayes classification are as follows:

Robust to isolated noise
Robust to irrelevant attributes
Its performance might degrade due to correlated attributes in the input dataset

The Naïve Bayes classification

The pseudocode of the Naïve Bayes classification algorithm, with minor differences from the Bayes classification algorithm, is as follows:

The R implementation

The R code for the Naïve Bayes classification is listed as follows:

  1 NaiveBayesClassifier <- function(data,classes){
  2     naive.bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8     variances.m <- GetVariancesMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(naive.bayes.model,cards)
 11     AddPriorProbability(naive.bayes.model,prior.p)
 12     AddMeans(naive.bayes.model,means)
 13     AddVariancesMatrix(naive.bayes.model,variances.m)
 14 
 15     return(naive.bayes.model)
 16 }

 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     naive.bayes.model <- NaiveBayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Naïve Bayes classification algorithm, in the following section.

Identify spam e-mail

E-mail spam is one of the major issues on the Internet. It refers to irrelevant, inappropriate, and unsolicited emails to irrelevant receivers, pursuing advertisement and promotion, spreading malware, and so on.

Note

Unsolicited, unwanted e-mail that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient is the formal definition of e-mail spam from spam track at the Text Retrieval Conference (TREC).

The increase in e-mail users, business e-mail campaigns, and suspicious usage of e-mail have resulted in a massive dataset of spam e-mails, which in turn necessitate high-efficiency solutions to detect e-mail spam:

E-mail spam filters are automated tools that recognize spam and prevent further delivery. The classifier serves as a spam detector here. One solution is to combine inputs of a couple of e-mail spam classifiers to present improved classification effectiveness and robustness.

Spam e-mail can be judged from its content, title, and so on. As a result, the attributes of the e-mails, such as subject, content, sender address, IP address, time-related attributes, in-count /out-count, and communication interaction average, can be selected into the attributes set of the data instance in the dataset. Example attributes include the occurrence of HTML form tags, IP-based URLs, age of link-to domains, nonmatching URLs, HTML e-mail, number of links in the e-mail body, and so on. The candidate attributes include discrete and continuous types.

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

The Naïve Bayes classification

The pseudocode of the Naïve Bayes classification algorithm, with minor differences from the Bayes classification algorithm, is as follows:

The R implementation

The R code for the Naïve Bayes classification is listed as follows:

  1 NaiveBayesClassifier <- function(data,classes){
  2     naive.bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8     variances.m <- GetVariancesMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(naive.bayes.model,cards)
 11     AddPriorProbability(naive.bayes.model,prior.p)
 12     AddMeans(naive.bayes.model,means)
 13     AddVariancesMatrix(naive.bayes.model,variances.m)
 14 
 15     return(naive.bayes.model)
 16 }

 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     naive.bayes.model <- NaiveBayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Naïve Bayes classification algorithm, in the following section.

Identify spam e-mail

Note

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

The R implementation

The R code for the Naïve Bayes classification is listed as follows:

  1 NaiveBayesClassifier <- function(data,classes){
  2     naive.bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8     variances.m <- GetVariancesMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(naive.bayes.model,cards)
 11     AddPriorProbability(naive.bayes.model,prior.p)
 12     AddMeans(naive.bayes.model,means)
 13     AddVariancesMatrix(naive.bayes.model,variances.m)
 14 
 15     return(naive.bayes.model)
 16 }

 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     naive.bayes.model <- NaiveBayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Naïve Bayes classification algorithm, in the following section.

Identify spam e-mail

Note

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

Identify spam e-mail

Note

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

Rule-based classification of player types in computer games and rule-based classification

Compared to other classification algorithms, the learned model for a rule-based classification is set up by an IF-THEN rules set. The rules set can be transformed from the decision tree or by the following algorithm. An IF-THEN rule has the following format:

          IF condition_holds_true THEN make_a_conclusion

An alternative format is as follows:

Rule-based classification of player types in computer games and rule-based classification

For a given instance or record in the source dataset, if the RULE antecedent holds true, the rule is defined to cover the instance, and it is satisfied.

Given a rule R, the coverage and accuracy are defined as follows:

Transformation from decision tree to decision rules

It is very convenient to transform the decision tree into a decision rules set for further processing. Along with every path from the root to a leaf in the decision tree, a rule can be written. The left-hand side, the rule antecedent, of any rule is constructed by the combination of the label of the nodes and the labels of the arcs, then the rule consequent by the leaf node. One example of extracting classification rules from the decision tree is illustrated in the following diagram:

One important question is the pruning of the resulting rules set.

Rule-based classification

Rules are learned sequentially, one at a time. Here is the pseudocode of the algorithm to build a rule-based classifier. The LearnOneRule function is designed with the greedy strategy. Its target is to cover the positive instance in the source dataset as much as possible, and none or as few as possible of the negative instance at the same time. All the instances in the source dataset with a specific class are defined as positive, and those that belong to other classes are considered to be negative. An initial rule r is generated, which keeps refining until the stop condition is met.

Sequential covering algorithm

The pseudocode of the generic sequential covering algorithm is as follows. The input parameters include the dataset with class-labeled tuples and the attributes set with all of their possible values. The output is a set of IF-THEN rules as follows:

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a direct rule-based classifier, in which the rule set is relatively convenient to interpret and the most practical for imbalance problems.

As per the growth of a rule, the algorithm starts from an empty rule and adds conjuncts, which maximize or improve the information gain measure, that is, the FOIL. It stops at the situation so that the rule does not cover negative rules. The resulting rule is pruned immediately with incremental reduced error pruning. Any final sequence of conditions is removed once it maximizes the measure of pruning v, which is calculated as follows:

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

Given are p as the number of positive examples covered by this rule and n as the number of negative rules covered by this rule. P denotes the number of positive examples of this class, and N the number of the negative examples of this class.

The pseudocode of the RIPPER algorithm is as follows:

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

Transformation from decision tree to decision rules

One important question is the pruning of the resulting rules set.

Rule-based classification

Sequential covering algorithm

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

The pseudocode of the RIPPER algorithm is as follows:

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

Rule-based classification

Sequential covering algorithm

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

The pseudocode of the RIPPER algorithm is as follows:

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

Sequential covering algorithm

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

The pseudocode of the RIPPER algorithm is as follows:

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

The pseudocode of the RIPPER algorithm is as follows:

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

Time for action

Here are some practices for you to check what you've learned so far:

Running the R code of the ID3 algorithm step by step upon a minor dataset to trace the values of the important factors at each step
Preparing the dataset related to web logs and creating an application that detects web attacks using ID3
Implementing an R code to generate decision rules from a decision tree
What is Gain Ratio?

Summary

In this chapter, we learned the following facts:

Classification is a class of dispatch instances to one of predefined categories
Decision tree induction is to learn the decision tree from the source dataset with the (instance and class-label) pairs under the supervised learning mode
ID3 is a decision tree induction algorithm
C4.5 is an extension of ID3
CART is a decision tree induction
Bayes classification is a statistical classification algorithm
Naïve Bayes classification is a simplified version of Bayes classification in which there is a presumption of independence
Rule-based classification is a classification model applying the rule set, which can be collections by direct algorithm, the sequential covering algorithm, and the indirect method by decision tree transforming

In the next chapter, you'll cover the more-advanced classification algorithms, including Bayesian Belief Network, SVM, k-Nearest Neighbors algorithm, and so on.

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Related Content you might be interested in

Current Title:

R: Mining spatial, text, web, and social media data

Chapter 3. Classification

Classification

Generic decision tree induction

Attribute selection measures

Tree pruning

General algorithm for the decision tree generation

The R implementation

Attribute selection measures

Tree pruning

General algorithm for the decision tree generation

The R implementation

Tree pruning

General algorithm for the decision tree generation

The R implementation

General algorithm for the decision tree generation

The R implementation

The R implementation

High-value credit card customers classification using ID3

Note

Note

The ID3 algorithm

The R implementation

Web attack detection

High-value credit card customers classification

The ID3 algorithm

The R implementation

Web attack detection

High-value credit card customers classification

The R implementation

Web attack detection

High-value credit card customers classification

Web attack detection

High-value credit card customers classification

High-value credit card customers classification

Web spam detection using C4.5

The C4.5 algorithm

The R implementation

A parallel version with MapReduce

Web spam detection

The C4.5 algorithm

The R implementation

A parallel version with MapReduce

Web spam detection

The R implementation

A parallel version with MapReduce

Web spam detection

A parallel version with MapReduce

Web spam detection

Web spam detection

Web key resource page judgment using CART

The CART algorithm

The R implementation

Web key resource page judgment

The CART algorithm

The R implementation

Web key resource page judgment

The R implementation

Web key resource page judgment

Web key resource page judgment

Trojan traffic identification method and Bayes classification

Estimating

Prior probability estimation

Likelihood estimation

The Bayes classification

The R implementation

Trojan traffic identification method

Estimating

Prior probability estimation

Likelihood estimation

The Bayes classification

The R implementation

Trojan traffic identification method