In this chapter, you will learn how to implement the top algorithms for clusters with R. The evaluation/benchmark/measure tools are also provided.
In this chapter, we will cover the following topics:
- Customer categorization analysis of e-commerce and DBSCAN
- Clustering web pages and OPTICS
- Visitor analysis in the browser cache and DENCLUE
- Recommendation system and STING
- Web sentiment analysis and CLIQUE
- Opinion mining and WAVE CLUSTER
- User search intent and the EM algorithm
- Customer purchase data analysis and clustering high-dimensional data
- SNS and clustering graph and network data
By defining the density and density measures of data point space, the clusters can be modeled as sections with certain density in the data space.
Density Based Spatial Clustering of Applications with Noise (DBSCAN) is one of the most popular density-based clustering algorithms. The major characteristics of DBSCAN are:
- Good at dealing with large datasets with noises
- Clusters of various shapes can be dealt with
DBSCAN is based on the classification of the data points in the dataset as core data points, border data points, and noise data points, with the support of the use of density relations between points, including directly density-reachable, density-reachable, density-connected points. Before we provide a detailed description of DBSCAN, let's illustrate these ideas.
A point is defined as a core point if the number of data points within the distance of the predefined parameter, Eps or , is greater than that of the predefined parameter, MinPts. The space within the Eps is called Eps-neighborhood or . An object, o, is noise only if there is no cluster that contains o. A border data object is any object that belongs to a cluster, but not a core data object.
Given a core object, p, and an object, q, the object, q, is directly density-reachable from p if .
Given a core object, p, and an object, q, the object q is density-reachable from p if and .
For two data objects, and , they are density-connected if .
The density-based cluster denotes a set of density-connected data objects that are maximal according to density reachability.
Here is an example illustrated in the following diagram:
The summarized pseudocode for the DBSCAN algorithm is as follows, with the following input/output parameters defined.
The input parameters for the DBSCAN algorithm are as follows:
- D, which is the training tuples dataset
- k, which is the neighbor list size
- Eps, which is the radius parameter that denotes the neighborhood area of a data point
- MinPts, which is the minimum number (the neighborhood density threshold) of points that must exist in the Eps-neighborhood
- The output of the algorithm
- A set of density-based clusters
Ordering points to identify the clustering structure, OPTICS, extends the DBSCAN algorithm and is based on the phenomenon that density-based clusters, with respect to a higher density, are completely contained in density-connected sets with respect to lower density.
To construct density-based clusters with different densities simultaneously, the objects are dealt with in a specific order when expanding a cluster, that is, according to the order, an object that is density-reachable with respect to the lowest for the higher density clusters will be finished first.
Two major concepts are introduced to illustrate the OPTICS algorithm: core-distance of an object, p, and reachability-distance of object, p. An example is illustrated in the following image. During which, core-distance (o
), reachability-distances, ,, for MinPts=4.
The core-distance of an object, p
, is denoted as the minimal value, , considering that the -neighborhood at least contains MinPts
data objects; otherwise, this is undefined.
Given two data objects, p and q, the reachability-distance of object, p, from q represents the smallest radius value that makes p density-reachable from q if q is a core data object; otherwise, this is undefined.
OPTICS outputs an ordering of all data objects in a given dataset in which the data objects are processed. For each object, the core-distance and a suitable reachability-distance for each data object are calculated and output together.
OPTICS randomly selects an object from the input dataset as the current data object, p. In addition, for the object, p, the -neighborhood is fetched, the core-distance is calculated, and the reachability-distance is undefined.
Given p as a core data object, OPTICS recalculates the reachability-distance for any object, q (in the neighborhood of p), from p and inserts q into OrderSeeds if q has not been processed.
Once the augmented cluster ordering of the input dataset is generated with respect to , MinPts
and a clustering-distance , the density-based clustering is performed with this order.
The summarized pseudocode for the OPTICS algorithm with a couple of supporter functions are as follows:
Please take a look at the R codes file ch_06_optics.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_optics.R")
Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.
Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.
The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.
The OPTICS algorithm
OPTICS randomly selects an object from the input dataset as the current data object, p. In addition, for the object, p, the -neighborhood is fetched, the core-distance is calculated, and the reachability-distance is undefined.
Given p as a core data object, OPTICS recalculates the reachability-distance for any object, q (in the neighborhood of p), from p and inserts q into OrderSeeds if q has not been processed.
Once the augmented cluster ordering of the input dataset is generated with respect to , MinPts
and a clustering-distance , the density-based clustering is performed with this order.
The summarized pseudocode for the OPTICS algorithm with a couple of supporter functions are as follows:
Please take a look at the R codes file ch_06_optics.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_optics.R")
Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.
Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.
The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.
The R implementation
Please take a look at the R codes file ch_06_optics.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_optics.R")
Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.
Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.
The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.
Clustering web pages
Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.
Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.
The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.
DENsity-based CLUstEring (DENCLUE) is a density-based clustering algorithm that depends on the support of density-distribution functions.
Before a detailed explanation on the DENCLUE algorithm, some concepts need to be introduced; they are influence function, density function, gradient, and density attractor.
The influence function of a specific data object can be any function for which the Gaussian kernel is usually used as the kernel at the data point.
The density function at a point, x, is defined as the sum of the influence functions of all the data objects at this data point.
A point is defined as a density attractor if it is a local maximum of the density function and is computed as
A gradient of the density function is defined in the following equation, given the density function, .
DENCLUE defines a density function for the data point space at first. All the local maxima data points are searched and found. Assign each data point to the nearest local maxima point to maximize the density related to it. Each group of data points bound with a local maxima point is defined as a cluster. As a postprocess, the cluster is discarded if its bound local maxima density is lower than the user-predefined value. The clusters are merged if there exists a path such that each point on the path has a higher density value than the user-predefined value.
Please take a look at the R codes file ch_06_denclue.R
from the bundle of R codes for previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_denclue.R")
The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.
The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:
The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.
Please take a look at the R codes file ch_06_denclue.R
from the bundle of R codes for previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_denclue.R")
The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.
The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:
The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.
The R implementation
Please take a look at the R codes file ch_06_denclue.R
from the bundle of R codes for previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_denclue.R")
The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.
The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:
The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.
Visitor analysis in the browser cache
The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.
The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:
The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.
STatistical Information Grid (STING) is a grid-based clustering algorithm. The dataset is recursively divided into a hierarchy structure. The whole input dataset serves as the root node in the hierarchy structure. Each cell/unit in a layer is composed of a couple of cells/units in the lower layer. An example is shown in the following diagram:
To support the query for a dataset, the statistical information of each unit is calculated in advance for further processing; this information is also called statistics parameters.
The characteristics of STING algorithms are (but not limited to) the following:
- A query-independent structure
- Intrinsically parallelizable
- Efficiency
Please take a look at the R codes file ch_06_sting.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_sting.R")
Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:
Please take a look at the R codes file ch_06_sting.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_sting.R")
Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:
The R implementation
Please take a look at the R codes file ch_06_sting.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_sting.R")
Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:
Recommendation systems
Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:
CLustering In QUEst (CLIQUE) is a bottom-up and grid-based clustering algorithm. The idea behind this algorithm is the Apriori feature, that is, the monotonicity of dense units with respect to dimensionality. If a set of data points, S, is a cluster in a k-dimensional projection of the space, then S is also contained in a cluster in any (k-1)-dimensional projections of this space.
The algorithm proceeds by passes. The one-dimensional dense units are produced by one pass through the data. The candidate k-dimensional units are generated using the candidate-generation procedure and the determined (k-l)-dimensional dense units that are fetched at the (k-1) pass.
The characteristics of the CLIQUE algorithm are as follows:
- Efficient for high-dimensional datasets
- Interpretability of results
- Scalability and usability
The CLIQUE algorithm contains three steps to cluster a dataset. First, a group of subspaces is selected to cluster the dataset. Then, clustering is independently executed in every subspace. Finally, a concise summary of every cluster is produced in the form of a disjunctive normal form (DNF) expression.
The summarized pseudocode for the CLIQUE algorithm is as follows:
- Indentification of subspaces that contain clusters.
- Indentification of cluster.
- Generation minimal description for the cluster.
The candidate-generation algorithm is illustrated as follows:
Here is the algorithm to find the connected components of the graph; this is equivalent to finding clusters:
Please take a look at the R codes file ch_06_clique.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_clique.R")
Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.
Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.
The CLIQUE algorithm
The summarized pseudocode for the CLIQUE algorithm is as follows:
- Indentification of subspaces that contain clusters.
- Indentification of cluster.
- Generation minimal description for the cluster.
The candidate-generation algorithm is illustrated as follows:
Here is the algorithm to find the connected components of the graph; this is equivalent to finding clusters:
Please take a look at the R codes file ch_06_clique.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_clique.R")
Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.
Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.
The R implementation
Please take a look at the R codes file ch_06_clique.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_clique.R")
Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.
Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.
Web sentiment analysis
Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.
Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.
The WAVE clustering algorithm is a grid-based clustering algorithm. It depends on the relation between spatial dataset and multidimensional signals. The idea is that the cluster in a multidimensional spatial dataset turns out to be more distinguishable after a wavelet transformation, that is, after applying wavelets to the input data or the preprocessed input dataset. The dense part segmented by the sparse area in the transformed result represents clusters.
The characteristics of the WAVE cluster algorithm are as follows:
- Efficient for a large dataset
- Efficient for finding various shapes of clusters
- Insensitive to noise or outlier
- Insensitive with respect to the input order of a dataset
- Multiresolution, which is introduced by wavelet transforms
- Applicable to any numerical dataset
The WAVE cluster algorithm performs a few steps. In the first step, it creates a grid and assigns each data object from the input dataset to a unit in the grid. And then, transform the data to a new space by applying wavelet transform functions. Third, find the connected component in the new space. Map the cluster label for related data object to the original data space.
Please take a look at the R codes file ch_06_wave.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_wave.R")
An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.
Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.
An opinion-orientation algorithm can be listed as follows:
- Identify opinion words and phrases
- Handle negation
- The But clause
- Aggregating opinion
The preceding formula denotes the opinion orientation on a certain feature, , where is an opinion word in s, is the distance between and . is the opinion orientation.
Please take a look at the R codes file ch_06_wave.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_wave.R")
An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.
Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.
An opinion-orientation algorithm can be listed as follows:
- Identify opinion words and phrases
- Handle negation
- The But clause
- Aggregating opinion
The preceding formula denotes the opinion orientation on a certain feature, , where is an opinion word in s, is the distance between and . is the opinion orientation.
The R implementation
Please take a look at the R codes file ch_06_wave.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_wave.R")
An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.
Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.
An opinion-orientation algorithm can be listed as follows:
- Identify opinion words and phrases
- Handle negation
- The But clause
- Aggregating opinion
The preceding formula denotes the opinion orientation on a certain feature, , where is an opinion word in s, is the distance between and . is the opinion orientation.
Opinion mining
An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.
Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.
An opinion-orientation algorithm can be listed as follows:
- Identify opinion words and phrases
- Handle negation
- The But clause
- Aggregating opinion
The preceding formula denotes the opinion orientation on a certain feature, , where is an opinion word in s, is the distance between and . is the opinion orientation.
The Expectation Maximization (EM) algorithm is a probabilistic-model-based clustering algorithm that depends on the mixture model in which the data is modeled by a mixture of simple models. The parameters related to these models are estimated by Maximum Likelihood Estimation (MLE).
Mixture models assume that the data is the result of the combination of various simple probabilistic distribution functions. Given K distribution functions and the jth distribution with the parameter, , is the set of of all distributions:
The EM algorithm performs in the following way. In the first step, an initial group of model parameters are selected. The expectation step is the second step that performs the calculation of the probability:
The previous equation represents the probability of each data object belonging to each distribution. Maximization is the third step. With the result of the expectation step, update the estimation of the parameters with the ones that maximize the expected likelihood:
The expectation step and maximization step are performed repeatedly until the output matches the ending condition, that is, the changes of the parameter estimations below a certain threshold.
Please take a look at the R codes file ch_06_em.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_em.R")
Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.
User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.
To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.
Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.
Please take a look at the R codes file ch_06_em.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_em.R")
Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.
User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.
To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.
Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.
The R implementation
Please take a look at the R codes file ch_06_em.R
from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:
> source("ch_06_em.R")
Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.
User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.
To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.
Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.
The user search intent
Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.
User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.
To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.
Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.
For high-dimensional data-space clustering, two major issues occur: efficiency and quality. New algorithms are needed to deal with this type of dataset. Two popular strategies are applied to it. One is the subspace-clustering strategy to find the cluster in the subspace of the original dataset space. Another is the dimensionality-reduction strategy, which is a lower dimensional data space created for further clustering.
MAFIA is an efficient and scalable subspace-clustering algorithm for high-dimensional and large datasets.
The summarized pseudocode for the MAFIA algorithm is as follows:
The summarized pseudocode for the parallel MAFIA algorithm is as follows:
The summarized pseudocodes for the SURFING algorithm are as follows. It selects interesting features from the original attributes of the dataset.
The MAFIA algorithm
The summarized pseudocode for the MAFIA algorithm is as follows:
The summarized pseudocode for the parallel MAFIA algorithm is as follows:
The summarized pseudocodes for the SURFING algorithm are as follows. It selects interesting features from the original attributes of the dataset.
The SURFING algorithm
The summarized pseudocodes for the SURFING algorithm are as follows. It selects interesting features from the original attributes of the dataset.
The clustering for graph and network data has a wide application in modern life, such as social networking. However, more challenges crop up along with the needs. High computational cost, sophisticated graphs, and high dimensionality and sparsity are the major concerns. With some special transformations, the issues can be transformed into graph cut issues.
Structural Clustering Algorithm for Network (SCAN) is one of the algorithms that searches for well-connected components in the graph as clusters.
Please take a look at the R codes file ch_06_scan.R
from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_06_scan.R")
Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.
The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.
The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.
Please take a look at the R codes file ch_06_scan.R
from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_06_scan.R")
Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.
The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.
The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.
The R implementation
Please take a look at the R codes file ch_06_scan.R
from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_06_scan.R")
Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.
The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.
The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.
Social networking service (SNS)
Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.
The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.
The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.
Here are some questions for you to know whether you have understood the concepts:
- What is the DBSCAN algorithm?
- What is the SCAN algorithm?
- What is the STING algorithm?
- What is the OPTICS algorithm?
- What is the constraint-based cluster method?
In this chapter, we covered the following facts:
- DBSCAN depends on the density-based description of clusters. With searching and measuring the density of data points, high density means high possibility of the existence of clusters, and others mean outliers or noise.
- OPTICS produces the cluster ordering that consists of the order of the data objects together with the corresponding reachability values and core values.
- You learned that DENCLUE is a clustering algorithm based on a set of specific density-distribution functions and can find arbitrary shape clusters. It first segments the dataset as cubes and identifies the local density-distribution function. You also learned that a hill-climbing algorithm is performed to retrieve the local maximum for each item in the related cube with which a cluster will be built.
- We saw that STING is based on the grid-like data structure, and it segments the embedding spatial area of the input data points into rectangular units. It is mainly used for a spatial dataset.
- You learned that CLIQUE is a grid-based clustering algorithm that finds subspaces of high-dimensional data, and it can find dense units in the high-dimensional data too.
- You know that the WAVE cluster is a grid-based clustering algorithm based on the wavelet transformations. It has a multiresolution and is efficient for large dataset.
- You learned that EM is a probabilistic-model-based clustering algorithm, where each data point with a probability indicates that it belongs to a cluster. It is based on the assumption that the attribute of the data point has values that is the linear combination of simple distributions.
- Clustering high-dimensional data.
- Clustering graph and network data.
In the next chapter, we will cover the major topics related to outlier detection and algorithms, and look at some of their examples.