Book Image

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel
Book Image

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Table of Contents (6 chapters)

Chapter 7. Outlier Detection

In this chapter, you will learn how to write R codes to detect outliers in real-world cases. Generally speaking, outliers arise for various reasons, such as the dataset being compromised with data from different classes and data measurement system errors.

As per their characteristics, outliers differ dramatically from the usual data in the original dataset. Versatile solutions are developed to detect them, which include model-based methods, proximity-based methods, density-based methods, and so on.

In this chapter, we will cover the following topics:

  • Credit card fraud detection and statistical methods
  • Activity monitoring—the detection of fraud of mobile phones and proximity-based methods
  • Intrusion detection and density-based methods
  • Intrusion detection and clustering-based methods
  • Monitoring the performance of network-based and classification-based methods
  • Detecting novelty in text, topic detection, and mining contextual outliers
  • Collective outliers on spatial data
  • Outlier detection in high-dimensional data

Here is a diagram illustrating a classification of outlier detection methods:

Outlier Detection

The output of an outlier detection system can be categorized into two groups: one is the labeled result and the other is the scored result (or an ordered list).

Credit card fraud detection and statistical methods

One major solution to detect outliers is the model-based method or statistical method. The outlier is defined as the object not belonging to the model that is used to represent the original dataset. In other words, that model does not generate the outlier.

Among the accurate models to be adopted for the specific dataset, there are many choices available such as Gaussian and Poisson. If the wrong model is used to detect outliers, the normal data point may wrongly be recognized as an outlier. In addition to applying the single distribution model, the mixture of distribution models is practical too.

Credit card fraud detection and statistical methods

The log-likelihood function is adopted to find the estimation of parameters of a model:

Credit card fraud detection and statistical methods
Credit card fraud detection and statistical methods
Credit card fraud detection and statistical methods

The likelihood-based outlier detection algorithm

The summarized pseudocode of the likelihood-based outlier detection algorithm is as follows:

The likelihood-based outlier detection algorithm

The R implementation

Look up the file of R codes, ch_07_lboutlier_detection.R, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_07_lboutlier_detection.R")

Credit card fraud detection

Fraud denotes the criminal activities that happen in various commercial companies, such as credit card, banking, or insurance companies. For credit card fraud detection, two major applications are covered, fraudulent application of credit card and fraudulent usage of credit card. The fraud represents behavior anomalous to the average usage of credit cards to certain users, that is, transaction records of the users.

This kind of outlier statistically denotes credit card theft, which deviates from the normal nature of criminal activities. Some examples of outliers in this case are high rate of purchase, very high payments, and so on.

The location of payment, the user, and the context are possible attributes in the dataset. The clustering algorithms are the possible solutions.

Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods

Two major approaches of proximity-based methods are distance-based and density-based outlier detection algorithms.

The NL algorithm

The summarized pseudocodes of the NL algorithm are as follows:

The NL algorithm

The FindAllOutsM algorithm

The following are the summarized pseudocode of the FindAllOutsM algorithm:

The FindAllOutsM algorithm

The FindAllOutsD algorithm

The summarized pseudocodes of the FindAllOutsD algorithm are as follows:

The FindAllOutsD algorithm

The distance-based algorithm

The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0), and The distance-based algorithm:

The distance-based algorithm

A The distance-based algorithm outlier is defined as a data point, o, and subjected to this formula:

The distance-based algorithm

Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k, n, and D, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k) returns k nearest objects in S to o, Maxdist (o, S) returns the maximum distance between o and points from S, and TopOutlier (S, n) returns the top n outliers in S according to the distance to their kth nearest neighbor.

The distance-based algorithm

The Dolphin algorithm

The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:

The Dolphin algorithm
The Dolphin algorithm
The Dolphin algorithm

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

The NL algorithm

The summarized pseudocodes of the NL algorithm are as follows:

The NL algorithm

The FindAllOutsM algorithm

The following are the summarized pseudocode of the FindAllOutsM algorithm:

The FindAllOutsM algorithm

The FindAllOutsD algorithm

The summarized pseudocodes of the FindAllOutsD algorithm are as follows:

The FindAllOutsD algorithm

The distance-based algorithm

The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0), and The distance-based algorithm:

The distance-based algorithm

A The distance-based algorithm outlier is defined as a data point, o, and subjected to this formula:

The distance-based algorithm

Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k, n, and D, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k) returns k nearest objects in S to o, Maxdist (o, S) returns the maximum distance between o and points from S, and TopOutlier (S, n) returns the top n outliers in S according to the distance to their kth nearest neighbor.

The distance-based algorithm

The Dolphin algorithm

The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:

The Dolphin algorithm
The Dolphin algorithm
The Dolphin algorithm

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

The FindAllOutsM algorithm

The following are the summarized pseudocode of the FindAllOutsM algorithm:

The FindAllOutsM algorithm

The FindAllOutsD algorithm

The summarized pseudocodes of the FindAllOutsD algorithm are as follows:

The FindAllOutsD algorithm

The distance-based algorithm

The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0), and The distance-based algorithm:

The distance-based algorithm

A The distance-based algorithm outlier is defined as a data point, o, and subjected to this formula:

The distance-based algorithm

Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k, n, and D, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k) returns k nearest objects in S to o, Maxdist (o, S) returns the maximum distance between o and points from S, and TopOutlier (S, n) returns the top n outliers in S according to the distance to their kth nearest neighbor.

The distance-based algorithm

The Dolphin algorithm

The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:

The Dolphin algorithm
The Dolphin algorithm
The Dolphin algorithm

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

The FindAllOutsD algorithm

The summarized pseudocodes of the FindAllOutsD algorithm are as follows:

The FindAllOutsD algorithm

The distance-based algorithm

The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0), and The distance-based algorithm:

The distance-based algorithm

A The distance-based algorithm outlier is defined as a data point, o, and subjected to this formula:

The distance-based algorithm

Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k, n, and D, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k) returns k nearest objects in S to o, Maxdist (o, S) returns the maximum distance between o and points from S, and TopOutlier (S, n) returns the top n outliers in S according to the distance to their kth nearest neighbor.

The distance-based algorithm

The Dolphin algorithm

The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:

The Dolphin algorithm
The Dolphin algorithm
The Dolphin algorithm

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

The distance-based algorithm

The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0), and The distance-based algorithm:

The distance-based algorithm

A The distance-based algorithm outlier is defined as a data point, o, and subjected to this formula:

The distance-based algorithm

Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k, n, and D, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k) returns k nearest objects in S to o, Maxdist (o, S) returns the maximum distance between o and points from S, and TopOutlier (S, n) returns the top n outliers in S according to the distance to their kth nearest neighbor.

The distance-based algorithm

The Dolphin algorithm

The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:

The Dolphin algorithm
The Dolphin algorithm
The Dolphin algorithm

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

The Dolphin algorithm

The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:

The Dolphin algorithm
The Dolphin algorithm
The Dolphin algorithm

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

The R implementation

Look up the file of R codes, ch_07_proximity_based.R, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_07_proximity_based.R")

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

Activity monitoring and the detection of mobile fraud

The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.

There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.

Intrusion detection and density-based methods

Here is a formal definition of outliers formalized based on concepts such as, LOF, LRD, and so on. Generally speaking, an outlier is a data point biased from others so much that it seems as if it has not been generated from the same distribution functions as others have been.

Given a dataset, D, a DB (x, y)-outlier, p, is defined like this:

Intrusion detection and density-based methods

The k-distance of the p data point denotes the distance between p and the data point, o, which is member of D:

Intrusion detection and density-based methods
Intrusion detection and density-based methods
Intrusion detection and density-based methods

The k-distance neighborhood of the p object is defined as follows, q being the k-Nearest Neighbor of p:

Intrusion detection and density-based methods

The following formula gives the reachability distance of an object, p, with respect to an object, o:

Intrusion detection and density-based methods

The Local Reachability Density (LRD) of a data object, o, is defined like this:

Intrusion detection and density-based methods

The Local Outlier Factor (LOF) is defined as follows, and it measures the degree of the outlierness:

Intrusion detection and density-based methods

A property of LOF (p) is defined as shown in the following equation:

Intrusion detection and density-based methods
Intrusion detection and density-based methods
Intrusion detection and density-based methods

These equations are illustrated as follows:

Intrusion detection and density-based methods

The OPTICS-OF algorithm

The input parameters for the bagging algorithm are:

  • The OPTICS-OF algorithm, the dataset
  • The OPTICS-OF algorithm, the parameter
  • The OPTICS-OF algorithm, another parameter

The output of the algorithm is the value of CBLOF, for all records.

The summarized pseudocodes of the OPTICS-OF algorithm are as follows:

The OPTICS-OF algorithm

The High Contrast Subspace algorithm

The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S, M, and The High Contrast Subspace algorithm. The output is a contrast, |S|.

The High Contrast Subspace algorithm

The R implementation

Look up the file of R codes, ch_07_ density _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:

> source("ch_07_ density _based.R")

Intrusion detection

Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.

The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.

An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:

Intrusion detection

The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.

The OPTICS-OF algorithm

The input parameters for the bagging algorithm are:

  • The OPTICS-OF algorithm, the dataset
  • The OPTICS-OF algorithm, the parameter
  • The OPTICS-OF algorithm, another parameter

The output of the algorithm is the value of CBLOF, for all records.

The summarized pseudocodes of the OPTICS-OF algorithm are as follows:

The OPTICS-OF algorithm

The High Contrast Subspace algorithm

The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S, M, and The High Contrast Subspace algorithm. The output is a contrast, |S|.

The High Contrast Subspace algorithm

The R implementation

Look up the file of R codes, ch_07_ density _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:

> source("ch_07_ density _based.R")

Intrusion detection

Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.

The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.

An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:

Intrusion detection

The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.

The High Contrast Subspace algorithm

The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S, M, and The High Contrast Subspace algorithm. The output is a contrast, |S|.

The High Contrast Subspace algorithm

The R implementation

Look up the file of R codes, ch_07_ density _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:

> source("ch_07_ density _based.R")

Intrusion detection

Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.

The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.

An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:

Intrusion detection

The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.

The R implementation

Look up the file of R codes, ch_07_ density _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:

> source("ch_07_ density _based.R")

Intrusion detection

Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.

The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.

An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:

Intrusion detection

The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.

Intrusion detection

Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.

The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.

An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:

Intrusion detection

The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.

Intrusion detection and clustering-based methods

The strategy of outlier detection technologies based on the clustering algorithm is focused on the relation between data objects and clusters.

Hierarchical clustering to detect outliers

Outlier detection that uses the hierarchical clustering algorithm is based on the k-Nearest Neighbor graph. The input parameters include the input dataset, DATA, of size, n, and each data point with k variables, the distance measure function (d), one hierarchical algorithm (h), threshold (t), and cluster number (nc).

Hierarchical clustering to detect outliers

The k-means-based algorithm

The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:

The k-means-based algorithm

The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:

  • Phase 1 (Data preparation):
    1. The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
    2. If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
  • Phase 2 (The outlier detection procedure):
    1. The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
    2. The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
  • Phase 3 (Review and validation):
    1. The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.

The ODIN algorithm

Outlier detection using the indegree number with the ODIN algorithm is based on the k-Nearest Neighbor graph.

The ODIN algorithm

The R implementation

Look up the file of R codes, ch_07_ clustering _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ clustering _based.R")

Hierarchical clustering to detect outliers

Outlier detection that uses the hierarchical clustering algorithm is based on the k-Nearest Neighbor graph. The input parameters include the input dataset, DATA, of size, n, and each data point with k variables, the distance measure function (d), one hierarchical algorithm (h), threshold (t), and cluster number (nc).

Hierarchical clustering to detect outliers

The k-means-based algorithm

The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:

The k-means-based algorithm

The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:

  • Phase 1 (Data preparation):
    1. The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
    2. If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
  • Phase 2 (The outlier detection procedure):
    1. The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
    2. The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
  • Phase 3 (Review and validation):
    1. The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.

The ODIN algorithm

Outlier detection using the indegree number with the ODIN algorithm is based on the k-Nearest Neighbor graph.

The ODIN algorithm

The R implementation

Look up the file of R codes, ch_07_ clustering _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ clustering _based.R")

The k-means-based algorithm

The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:

The k-means-based algorithm

The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:

  • Phase 1 (Data preparation):
    1. The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
    2. If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
  • Phase 2 (The outlier detection procedure):
    1. The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
    2. The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
  • Phase 3 (Review and validation):
    1. The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.

The ODIN algorithm

Outlier detection using the indegree number with the ODIN algorithm is based on the k-Nearest Neighbor graph.

The ODIN algorithm

The R implementation

Look up the file of R codes, ch_07_ clustering _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ clustering _based.R")

The ODIN algorithm

Outlier detection using the indegree number with the ODIN algorithm is based on the k-Nearest Neighbor graph.

The ODIN algorithm

The R implementation

Look up the file of R codes, ch_07_ clustering _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ clustering _based.R")

The R implementation

Look up the file of R codes, ch_07_ clustering _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ clustering _based.R")

Monitoring the performance of the web server and classification-based methods

Classification algorithms can be used to detect outliers. The ordinary strategy is to train a one-class model only for the normal data point in the training dataset. Once you set up the model, any data point that is not accepted by the model is marked as an outlier.

Monitoring the performance of the web server and classification-based methods

The OCSVM algorithm

The OCSVM (One Class SVM) algorithm projects input data into a high-dimensional feature space. Along with this process, it iteratively finds the maximum-margin hyperplane. The hyperplane defined in a Gaussian reproducing kernel Hilbert space best separates the training data from the origin. When The OCSVM algorithm, the major portion of outliers or the solution of OCSVM can be represented by the solution of the following equation (subject to The OCSVM algorithm and The OCSVM algorithm):

The OCSVM algorithm
The OCSVM algorithm

The one-class nearest neighbor algorithm

This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.

The local density is denoted as follows:

The one-class nearest neighbor algorithm

The distance between the test object, x, and its nearest neighbor in the training set, The one-class nearest neighbor algorithm, is defined like this:

The one-class nearest neighbor algorithm

The distance between this nearest neighbor (The one-class nearest neighbor algorithm) and its nearest neighbor in the training set (The one-class nearest neighbor algorithm) is defined as follows:

The one-class nearest neighbor algorithm

One data object is marked as an outlier once The one-class nearest neighbor algorithm, or, in another format, is marked as The one-class nearest neighbor algorithm.

The R implementation

Look up the file of R codes, ch_07_ classification _based.R, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ classification _based.R")

Monitoring the performance of the web server

Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.

The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.

The OCSVM algorithm

The OCSVM (One Class SVM) algorithm projects input data into a high-dimensional feature space. Along with this process, it iteratively finds the maximum-margin hyperplane. The hyperplane defined in a Gaussian reproducing kernel Hilbert space best separates the training data from the origin. When The OCSVM algorithm, the major portion of outliers or the solution of OCSVM can be represented by the solution of the following equation (subject to The OCSVM algorithm and The OCSVM algorithm):

The OCSVM algorithm
The OCSVM algorithm

The one-class nearest neighbor algorithm

This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.

The local density is denoted as follows:

The one-class nearest neighbor algorithm

The distance between the test object, x, and its nearest neighbor in the training set, The one-class nearest neighbor algorithm, is defined like this:

The one-class nearest neighbor algorithm

The distance between this nearest neighbor (The one-class nearest neighbor algorithm) and its nearest neighbor in the training set (The one-class nearest neighbor algorithm) is defined as follows:

The one-class nearest neighbor algorithm

One data object is marked as an outlier once The one-class nearest neighbor algorithm, or, in another format, is marked as The one-class nearest neighbor algorithm.

The R implementation

Look up the file of R codes, ch_07_ classification _based.R, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ classification _based.R")

Monitoring the performance of the web server

Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.

The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.

The one-class nearest neighbor algorithm

This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.

The local density is denoted as follows:

The one-class nearest neighbor algorithm

The distance between the test object, x, and its nearest neighbor in the training set, The one-class nearest neighbor algorithm, is defined like this:

The one-class nearest neighbor algorithm

The distance between this nearest neighbor (The one-class nearest neighbor algorithm) and its nearest neighbor in the training set (The one-class nearest neighbor algorithm) is defined as follows:

The one-class nearest neighbor algorithm

One data object is marked as an outlier once The one-class nearest neighbor algorithm, or, in another format, is marked as The one-class nearest neighbor algorithm.

The R implementation

Look up the file of R codes, ch_07_ classification _based.R, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ classification _based.R")

Monitoring the performance of the web server

Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.

The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.

The R implementation

Look up the file of R codes, ch_07_ classification _based.R, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ classification _based.R")

Monitoring the performance of the web server

Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.

The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.

Monitoring the performance of the web server

Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.

The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.

Detecting novelty in text, topic detection, and mining contextual outliers

If each of the data instances in the training dataset is related to a specific context attribute, then the dramatic deviation of the data point from the context will be termed as an outlier. There are many applications of this assumption.

The conditional anomaly detection (CAD) algorithm

The summarized pseudocodes of the CAD algorithm are as follows:

The conditional anomaly detection (CAD) algorithm

The following are the summarized pseudocodes of the GMM-CAD-Full algorithm:

The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm

The summarized pseudocodes of the GMM-CAD-Split algorithm are as follows:

The conditional anomaly detection (CAD) algorithm

The R implementation

Look up the file of R codes, ch_07_ contextual _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ contextual _based.R")

Detecting novelty in text and topic detection

One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.

With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.

The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.

The conditional anomaly detection (CAD) algorithm

The summarized pseudocodes of the CAD algorithm are as follows:

The conditional anomaly detection (CAD) algorithm

The following are the summarized pseudocodes of the GMM-CAD-Full algorithm:

The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm
The conditional anomaly detection (CAD) algorithm

The summarized pseudocodes of the GMM-CAD-Split algorithm are as follows:

The conditional anomaly detection (CAD) algorithm

The R implementation

Look up the file of R codes, ch_07_ contextual _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ contextual _based.R")

Detecting novelty in text and topic detection

One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.

With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.

The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.

The R implementation

Look up the file of R codes, ch_07_ contextual _based.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ contextual _based.R")

Detecting novelty in text and topic detection

One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.

With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.

The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.

Detecting novelty in text and topic detection

One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.

With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.

The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.

Collective outliers on spatial data

Given a dataset, if a collection of related data instances is anomalous with respect to the entire dataset, it is defined as a collective outlier.

The route outlier detection (ROD) algorithm

The summarized pseudocodes of the ROD algorithm are as follows. The input parameters include the multidimensional attribute space, attribute dataset (D), distance measure function (F), the depth of neighbor (ND), spatial graph (G = (V, E) ), and confidence interval (CI):

The route outlier detection (ROD) algorithm

The R implementation

Look up the file of R codes, ch_07_ rod.R, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_07_ rod.R")

Characteristics of collective outliers

Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.

Collective outliers may be a sequence of data, spatial data, and so on.

The route outlier detection (ROD) algorithm

The summarized pseudocodes of the ROD algorithm are as follows. The input parameters include the multidimensional attribute space, attribute dataset (D), distance measure function (F), the depth of neighbor (ND), spatial graph (G = (V, E) ), and confidence interval (CI):

The route outlier detection (ROD) algorithm

The R implementation

Look up the file of R codes, ch_07_ rod.R, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_07_ rod.R")

Characteristics of collective outliers

Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.

Collective outliers may be a sequence of data, spatial data, and so on.

The R implementation

Look up the file of R codes, ch_07_ rod.R, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_07_ rod.R")

Characteristics of collective outliers

Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.

Collective outliers may be a sequence of data, spatial data, and so on.

Characteristics of collective outliers

Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.

Collective outliers may be a sequence of data, spatial data, and so on.

Outlier detection in high-dimensional data

Outlier detection in high-dimensional data has some characteristics that make it different from other outlier detection problems.

The brute-force algorithm

The summarized pseudocodes of the brute-force algorithm are as follows:

The brute-force algorithm

The HilOut algorithm

The following are the summarized pseudocodes of the HilOut algorithm:

The HilOut algorithm
The HilOut algorithm
The HilOut algorithm
The HilOut algorithm

The R implementation

Look up the R file, hil_out.R, from the bundle of R codes for the HilOut algorithm.

Look up the file of R codes, ch_07_ hilout.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ hilout.R")

The brute-force algorithm

The summarized pseudocodes of the brute-force algorithm are as follows:

The brute-force algorithm

The HilOut algorithm

The following are the summarized pseudocodes of the HilOut algorithm:

The HilOut algorithm
The HilOut algorithm
The HilOut algorithm
The HilOut algorithm

The R implementation

Look up the R file, hil_out.R, from the bundle of R codes for the HilOut algorithm.

Look up the file of R codes, ch_07_ hilout.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ hilout.R")

The HilOut algorithm

The following are the summarized pseudocodes of the HilOut algorithm:

The HilOut algorithm
The HilOut algorithm
The HilOut algorithm
The HilOut algorithm

The R implementation

Look up the R file, hil_out.R, from the bundle of R codes for the HilOut algorithm.

Look up the file of R codes, ch_07_ hilout.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ hilout.R")

The R implementation

Look up the R file, hil_out.R, from the bundle of R codes for the HilOut algorithm.

Look up the file of R codes, ch_07_ hilout.R, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_07_ hilout.R")

Time for action

Here are some practice questions for you so as to check your understanding of the concepts:

  • What is an outlier?
  • List as many types of outliers as possible and your categorization measures.
  • List as many areas of application of outlier detection as possible.

Summary

In this chapter, we looked at:

  • Statistical methods based on probabilistic distribution functions. The normal data points are those that are generated by the models. Otherwise, they are defined as outliers.
  • Proximity-based methods.
  • Density-based methods.
  • Clustering-based methods.
  • Classification-based methods.
  • Mining contextual outliers.
  • Collective outliers.
  • Outlier detection in high-dimensional data.

The next chapter will cover the major topics related to outlier detection algorithms and examples for them, which are based on the previous chapters. All of this will be covered with a major difference in our viewpoint.