In this chapter, you will learn how to write R codes to detect outliers in real-world cases. Generally speaking, outliers arise for various reasons, such as the dataset being compromised with data from different classes and data measurement system errors.
As per their characteristics, outliers differ dramatically from the usual data in the original dataset. Versatile solutions are developed to detect them, which include model-based methods, proximity-based methods, density-based methods, and so on.
In this chapter, we will cover the following topics:
- Credit card fraud detection and statistical methods
- Activity monitoring—the detection of fraud of mobile phones and proximity-based methods
- Intrusion detection and density-based methods
- Intrusion detection and clustering-based methods
- Monitoring the performance of network-based and classification-based methods
- Detecting novelty in text, topic detection, and mining contextual outliers
- Collective outliers on spatial data
- Outlier detection in high-dimensional data
Here is a diagram illustrating a classification of outlier detection methods:
The output of an outlier detection system can be categorized into two groups: one is the labeled result and the other is the scored result (or an ordered list).
One major solution to detect outliers is the model-based method or statistical method. The outlier is defined as the object not belonging to the model that is used to represent the original dataset. In other words, that model does not generate the outlier.
Among the accurate models to be adopted for the specific dataset, there are many choices available such as Gaussian and Poisson. If the wrong model is used to detect outliers, the normal data point may wrongly be recognized as an outlier. In addition to applying the single distribution model, the mixture of distribution models is practical too.
The log-likelihood function is adopted to find the estimation of parameters of a model:
Look up the file of R codes, ch_07_lboutlier_detection.R
, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_07_lboutlier_detection.R")
Fraud denotes the criminal activities that happen in various commercial companies, such as credit card, banking, or insurance companies. For credit card fraud detection, two major applications are covered, fraudulent application of credit card and fraudulent usage of credit card. The fraud represents behavior anomalous to the average usage of credit cards to certain users, that is, transaction records of the users.
This kind of outlier statistically denotes credit card theft, which deviates from the normal nature of criminal activities. Some examples of outliers in this case are high rate of purchase, very high payments, and so on.
The location of payment, the user, and the context are possible attributes in the dataset. The clustering algorithms are the possible solutions.
Two major approaches of proximity-based methods are distance-based and density-based outlier detection algorithms.
The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0
), and :
A outlier is defined as a data point, o
, and subjected to this formula:
Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k
, n
, and D
, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k)
returns k
nearest objects in S
to o
, Maxdist (o, S)
returns the maximum distance between o
and points from S
, and TopOutlier (S, n)
returns the top n outliers in S
according to the distance to their kth nearest neighbor.
The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0
), and :
A outlier is defined as a data point, o
, and subjected to this formula:
Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k
, n
, and D
, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k)
returns k
nearest objects in S
to o
, Maxdist (o, S)
returns the maximum distance between o
and points from S
, and TopOutlier (S, n)
returns the top n outliers in S
according to the distance to their kth nearest neighbor.
The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
The FindAllOutsM algorithm
The following are the summarized pseudocode of the FindAllOutsM algorithm:
The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0
), and :
A outlier is defined as a data point, o
, and subjected to this formula:
Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k
, n
, and D
, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k)
returns k
nearest objects in S
to o
, Maxdist (o, S)
returns the maximum distance between o
and points from S
, and TopOutlier (S, n)
returns the top n outliers in S
according to the distance to their kth nearest neighbor.
The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0
), and :
A outlier is defined as a data point, o
, and subjected to this formula:
Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k
, n
, and D
, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k)
returns k
nearest objects in S
to o
, Maxdist (o, S)
returns the maximum distance between o
and points from S
, and TopOutlier (S, n)
returns the top n outliers in S
according to the distance to their kth nearest neighbor.
The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
The distance-based algorithm
The summarized pseudocodes of the distance-based outlier detection algorithm are as follows, given a dataset D, size of the input dataset n, threshold r (r > 0
), and :
A outlier is defined as a data point, o
, and subjected to this formula:
Let's now learn the pseudocodes for a variety of distance-based outlier detection algorithms, which are summarized in the following list. The input parameters are k
, n
, and D
, which represent the neighbors' number, outlier number to be identified, and input dataset, respectively. A few supporter functions also are defined. Nearest (o, S, k)
returns k
nearest objects in S
to o
, Maxdist (o, S)
returns the maximum distance between o
and points from S
, and TopOutlier (S, n)
returns the top n outliers in S
according to the distance to their kth nearest neighbor.
The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
The Dolphin algorithm
The Dolphin algorithm is a distance-based outlier detection algorithm. The summarized pseudocodes of this algorithm are listed as follows:
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
The R implementation
Look up the file of R codes, ch_07_proximity_based.R
, from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:
> source("ch_07_proximity_based.R")
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
Activity monitoring and the detection of mobile fraud
The purpose of outlier detection is to find the patterns in source datasets that do not conform to the standard behavior. The dataset here consists of the calling records, and the patterns exist in the calling records.
There are many special algorithms developed for each specific domain. Misuse of a mobile is termed as mobile fraud. The subject under research is the calling activity or call records. The related attributes include, but are not limited to, call duration, calling city, call day, and various services' ratios.
Here is a formal definition of outliers formalized based on concepts such as, LOF, LRD, and so on. Generally speaking, an outlier is a data point biased from others so much that it seems as if it has not been generated from the same distribution functions as others have been.
Given a dataset, D
, a DB (x, y)-outlier, p
, is defined like this:
The k-distance of the p
data point denotes the distance between p
and the data point, o
, which is member of D
:
The k-distance neighborhood of the p
object is defined as follows, q
being the k-Nearest Neighbor of p
:
The following formula gives the reachability distance of an object, p
, with respect to an object, o
:
The Local Reachability Density (LRD) of a data object, o
, is defined like this:
The Local Outlier Factor (LOF) is defined as follows, and it measures the degree of the outlierness:
A property of LOF (p)
is defined as shown in the following equation:
These equations are illustrated as follows:
The input parameters for the bagging algorithm are:
- , the dataset
- , the parameter
- , another parameter
The output of the algorithm is the value of CBLOF, for all records.
The summarized pseudocodes of the OPTICS-OF algorithm are as follows:
The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S
, M
, and . The output is a contrast, |S|
.
Look up the file of R codes, ch_07_ density _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:
> source("ch_07_ density _based.R")
Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.
The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.
An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:
The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.
The OPTICS-OF algorithm
The input parameters for the bagging algorithm are:
- , the dataset
- , the parameter
- , another parameter
The output of the algorithm is the value of CBLOF, for all records.
The summarized pseudocodes of the OPTICS-OF algorithm are as follows:
The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S
, M
, and . The output is a contrast, |S|
.
Look up the file of R codes, ch_07_ density _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:
> source("ch_07_ density _based.R")
Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.
The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.
An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:
The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.
The High Contrast Subspace algorithm
The summarized pseudocodes of the High Contrast Subspace (HiCS) algorithm are as follows, where the input parameters are S
, M
, and . The output is a contrast, |S|
.
Look up the file of R codes, ch_07_ density _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:
> source("ch_07_ density _based.R")
Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.
The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.
An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:
The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.
The R implementation
Look up the file of R codes, ch_07_ density _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested using the following command:
> source("ch_07_ density _based.R")
Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.
The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.
An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:
The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.
Intrusion detection
Any malicious activity against systems, networks, and servers can be treated as intrusion, and finding such activities is called intrusion detection.
The characteristics of situations where you can detect intrusion are high volume of data, missing labeled data in the dataset (which can be training data for some specific solution), time series data, and false alarm rate in the input dataset.
An intrusion detection system is of two types: host-based and network-based intrusion detection systems. A popular architecture for intrusion detection based on data mining is illustrated in the following diagram:
The core algorithms applied in an outlier detection system are usually semi-supervised or unsupervised according to the characteristics of intrusion detection.
The strategy of outlier detection technologies based on the clustering algorithm is focused on the relation between data objects and clusters.
Outlier detection that uses the hierarchical clustering algorithm is based on the k-Nearest Neighbor graph. The input parameters include the input dataset, DATA
, of size, n
, and each data point with k
variables, the distance measure function (d
), one hierarchical algorithm (h
), threshold (t
), and cluster number (nc
).
The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:
The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:
- Phase 1 (Data preparation):
- The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
- If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
- Phase 2 (The outlier detection procedure):
- The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
- The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
- Phase 3 (Review and validation):
- The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.
Hierarchical clustering to detect outliers
Outlier detection that uses the hierarchical clustering algorithm is based on the k-Nearest Neighbor graph. The input parameters include the input dataset, DATA
, of size, n
, and each data point with k
variables, the distance measure function (d
), one hierarchical algorithm (h
), threshold (t
), and cluster number (nc
).
The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:
The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:
- Phase 1 (Data preparation):
- The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
- If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
- Phase 2 (The outlier detection procedure):
- The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
- The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
- Phase 3 (Review and validation):
- The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.
The k-means-based algorithm
The process of the outlier detection based on the k-means algorithm is illustrated in the following diagram:
The summarized pseudocodes of outlier detection using the k-means algorithm are as follows:
- Phase 1 (Data preparation):
- The target observations and attributes should be aligned to improve the accuracy of the result and performance of the k-means clustering algorithm.
- If the original dataset has missing data, its handling activity must be carried out. The data of maximum likelihood that is anticipated by the EM algorithm is fed as input into the missing data.
- Phase 2 (The outlier detection procedure):
- The k value should be determined in order to run the k-means clustering algorithm. The proper k value is decided by referring to the value of the Cubic Clustering Criterion.
- The k-means clustering algorithm runs with the decided k value. On completion, the expert checks the external and internal outliers in the clustering results. If the other groups' elimination of the outliers is more meaningful, then he/she stops this procedure. If the other groups need to be recalculated, he/she again runs the k-means clustering algorithms but without the detected outliers.
- Phase 3 (Review and validation):
- The result of the previous phase is only a candidate for this phase. By considering the domain knowledge, we can find the true outliers.
Classification algorithms can be used to detect outliers. The ordinary strategy is to train a one-class model only for the normal data point in the training dataset. Once you set up the model, any data point that is not accepted by the model is marked as an outlier.
The OCSVM (One Class SVM) algorithm projects input data into a high-dimensional feature space. Along with this process, it iteratively finds the maximum-margin hyperplane. The hyperplane defined in a Gaussian reproducing kernel Hilbert space best separates the training data from the origin. When , the major portion of outliers or the solution of OCSVM can be represented by the solution of the following equation (subject to and ):
This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.
The local density is denoted as follows:
The distance between the test object, x
, and its nearest neighbor in the training set, , is defined like this:
The distance between this nearest neighbor () and its nearest neighbor in the training set () is defined as follows:
One data object is marked as an outlier once , or, in another format, is marked as .
Look up the file of R codes, ch_07_ classification _based.R
, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ classification _based.R")
Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.
The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.
The OCSVM algorithm
The OCSVM (One Class SVM) algorithm projects input data into a high-dimensional feature space. Along with this process, it iteratively finds the maximum-margin hyperplane. The hyperplane defined in a Gaussian reproducing kernel Hilbert space best separates the training data from the origin. When , the major portion of outliers or the solution of OCSVM can be represented by the solution of the following equation (subject to and ):
This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.
The local density is denoted as follows:
The distance between the test object, x
, and its nearest neighbor in the training set, , is defined like this:
The distance between this nearest neighbor () and its nearest neighbor in the training set () is defined as follows:
One data object is marked as an outlier once , or, in another format, is marked as .
Look up the file of R codes, ch_07_ classification _based.R
, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ classification _based.R")
Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.
The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.
The one-class nearest neighbor algorithm
This algorithm is based on the k-Nearest Neighbor algorithm. A couple of formulas are added.
The local density is denoted as follows:
The distance between the test object, x
, and its nearest neighbor in the training set, , is defined like this:
The distance between this nearest neighbor () and its nearest neighbor in the training set () is defined as follows:
One data object is marked as an outlier once , or, in another format, is marked as .
Look up the file of R codes, ch_07_ classification _based.R
, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ classification _based.R")
Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.
The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.
The R implementation
Look up the file of R codes, ch_07_ classification _based.R
, from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ classification _based.R")
Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.
The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.
Monitoring the performance of the web server
Web server performance measurements are really important to the business and for operating system management. These measurements can be in the form of CPU usage, network bandwidth, storage, and so on.
The dataset comes from various sources such as benchmark data, logs, and so on. The types of outliers that appear during the monitoring of the web server are point outliers, contextual outliers, and collective outliers.
If each of the data instances in the training dataset is related to a specific context attribute, then the dramatic deviation of the data point from the context will be termed as an outlier. There are many applications of this assumption.
The summarized pseudocodes of the CAD algorithm are as follows:
The following are the summarized pseudocodes of the GMM-CAD-Full algorithm:
The summarized pseudocodes of the GMM-CAD-Split algorithm are as follows:
Look up the file of R codes, ch_07_ contextual _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ contextual _based.R")
One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.
With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.
The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.
The conditional anomaly detection (CAD) algorithm
The summarized pseudocodes of the CAD algorithm are as follows:
The following are the summarized pseudocodes of the GMM-CAD-Full algorithm:
The summarized pseudocodes of the GMM-CAD-Split algorithm are as follows:
Look up the file of R codes, ch_07_ contextual _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ contextual _based.R")
One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.
With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.
The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.
The R implementation
Look up the file of R codes, ch_07_ contextual _based.R
, from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_07_ contextual _based.R")
One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.
With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.
The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.
Detecting novelty in text and topic detection
One application of outlier detection is in finding novel topics in a collection of documents or articles from newspapers. Major detection includes opinion detection. This is basically an outlier among a lot of opinions.
With the increase in social media, there are many events that happen every day. The earlier collection was that of only special events or ideas for related researchers or companies.
The characteristics related to increase in the collection are the various sources of data, documents in different formats, high-dimensional attributes, and sparse source data.
Given a dataset, if a collection of related data instances is anomalous with respect to the entire dataset, it is defined as a collective outlier.
The summarized pseudocodes of the ROD algorithm are as follows. The input parameters include the multidimensional attribute space, attribute dataset (D
), distance measure function (F
), the depth of neighbor (ND
), spatial graph (G = (V, E)
), and confidence interval (CI
):
Look up the file of R codes, ch_07_ rod.R
, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_07_ rod.R")
Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.
Collective outliers may be a sequence of data, spatial data, and so on.
The route outlier detection (ROD) algorithm
The summarized pseudocodes of the ROD algorithm are as follows. The input parameters include the multidimensional attribute space, attribute dataset (D
), distance measure function (F
), the depth of neighbor (ND
), spatial graph (G = (V, E)
), and confidence interval (CI
):
Look up the file of R codes, ch_07_ rod.R
, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_07_ rod.R")
Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.
Collective outliers may be a sequence of data, spatial data, and so on.
The R implementation
Look up the file of R codes, ch_07_ rod.R
, from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_07_ rod.R")
Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.
Collective outliers may be a sequence of data, spatial data, and so on.
Characteristics of collective outliers
Collective outliers denote a collection of data that is an abnormal contrast to the input dataset. As a major characteristic, only the collection of data appearing together will be collective outliers, but specific data itself in that collection does not appear together with other data in that collection of data, which is definitely not an outlier. Another characteristic of a collective outlier is that it can be a contextual outlier.
Collective outliers may be a sequence of data, spatial data, and so on.
Outlier detection in high-dimensional data has some characteristics that make it different from other outlier detection problems.
Here are some practice questions for you so as to check your understanding of the concepts:
- What is an outlier?
- List as many types of outliers as possible and your categorization measures.
- List as many areas of application of outlier detection as possible.
In this chapter, we looked at:
- Statistical methods based on probabilistic distribution functions. The normal data points are those that are generated by the models. Otherwise, they are defined as outliers.
- Proximity-based methods.
- Density-based methods.
- Clustering-based methods.
- Classification-based methods.
- Mining contextual outliers.
- Collective outliers.
- Outlier detection in high-dimensional data.
The next chapter will cover the major topics related to outlier detection algorithms and examples for them, which are based on the previous chapters. All of this will be covered with a major difference in our viewpoint.