Book Image

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel
Book Image

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Table of Contents (6 chapters)

Chapter 6. Advanced Cluster Analysis

In this chapter, you will learn how to implement the top algorithms for clusters with R. The evaluation/benchmark/measure tools are also provided.

In this chapter, we will cover the following topics:

  • Customer categorization analysis of e-commerce and DBSCAN
  • Clustering web pages and OPTICS
  • Visitor analysis in the browser cache and DENCLUE
  • Recommendation system and STING
  • Web sentiment analysis and CLIQUE
  • Opinion mining and WAVE CLUSTER
  • User search intent and the EM algorithm
  • Customer purchase data analysis and clustering high-dimensional data
  • SNS and clustering graph and network data

Customer categorization analysis of e-commerce and DBSCAN

By defining the density and density measures of data point space, the clusters can be modeled as sections with certain density in the data space.

Density Based Spatial Clustering of Applications with Noise (DBSCAN) is one of the most popular density-based clustering algorithms. The major characteristics of DBSCAN are:

  • Good at dealing with large datasets with noises
  • Clusters of various shapes can be dealt with

DBSCAN is based on the classification of the data points in the dataset as core data points, border data points, and noise data points, with the support of the use of density relations between points, including directly density-reachable, density-reachable, density-connected points. Before we provide a detailed description of DBSCAN, let's illustrate these ideas.

A point is defined as a core point if the number of data points within the distance of the predefined parameter, Eps or Customer categorization analysis of e-commerce and DBSCAN, is greater than that of the predefined parameter, MinPts. The space within the Eps is called Eps-neighborhood or Customer categorization analysis of e-commerce and DBSCAN. An object, o, is noise only if there is no cluster that contains o. A border data object is any object that belongs to a cluster, but not a core data object.

Given a core object, p, and an object, q, the object, q, is directly density-reachable from p if Customer categorization analysis of e-commerce and DBSCAN.

Given a core object, p, and an object, q, the object q is density-reachable from p if Customer categorization analysis of e-commerce and DBSCAN and Customer categorization analysis of e-commerce and DBSCAN.

For two data objects, Customer categorization analysis of e-commerce and DBSCAN and Customer categorization analysis of e-commerce and DBSCAN, they are density-connected if Customer categorization analysis of e-commerce and DBSCAN.

The density-based cluster denotes a set of density-connected data objects that are maximal according to density reachability.

Here is an example illustrated in the following diagram:

Customer categorization analysis of e-commerce and DBSCAN

The DBSCAN algorithm

The summarized pseudocode for the DBSCAN algorithm is as follows, with the following input/output parameters defined.

The input parameters for the DBSCAN algorithm are as follows:

  • D, which is the training tuples dataset
  • k, which is the neighbor list size
  • Eps, which is the radius parameter that denotes the neighborhood area of a data point
  • MinPts, which is the minimum number (the neighborhood density threshold) of points that must exist in the Eps-neighborhood
  • The output of the algorithm
  • A set of density-based clusters
The DBSCAN algorithm

Customer categorization analysis of e-commerce

Customers of e-commerce can be categorized by the psychographic, culturally specific purchasing behavior. The result of the customer categorization can make the storeowner efficiently and effectively respond to the customer. The e-commerce general analysis process is illustrated as follows:

Customer categorization analysis of e-commerce

Clustering web pages and OPTICS

Ordering points to identify the clustering structure, OPTICS, extends the DBSCAN algorithm and is based on the phenomenon that density-based clusters, with respect to a higher density, are completely contained in density-connected sets with respect to lower density.

To construct density-based clusters with different densities simultaneously, the objects are dealt with in a specific order when expanding a cluster, that is, according to the order, an object that is density-reachable with respect to the lowest Clustering web pages and OPTICS for the higher density clusters will be finished first.

Two major concepts are introduced to illustrate the OPTICS algorithm: core-distance of an object, p, and reachability-distance of object, p. An example is illustrated in the following image. During which, core-distance (o), reachability-distances, Clustering web pages and OPTICS,Clustering web pages and OPTICS, for MinPts=4.

Clustering web pages and OPTICS

The core-distance of an object, p, is denoted as the minimal value, Clustering web pages and OPTICS, considering that the Clustering web pages and OPTICS-neighborhood at least contains MinPts data objects; otherwise, this is undefined.

Given two data objects, p and q, the reachability-distance of object, p, from q represents the smallest radius value that makes p density-reachable from q if q is a core data object; otherwise, this is undefined.

OPTICS outputs an ordering of all data objects in a given dataset in which the data objects are processed. For each object, the core-distance and a suitable reachability-distance for each data object are calculated and output together.

The OPTICS algorithm

OPTICS randomly selects an object from the input dataset as the current data object, p. In addition, for the object, p, the The OPTICS algorithm-neighborhood is fetched, the core-distance is calculated, and the reachability-distance is undefined.

Given p as a core data object, OPTICS recalculates the reachability-distance for any object, q (in the neighborhood of p), from p and inserts q into OrderSeeds if q has not been processed.

Once the augmented cluster ordering of the input dataset is generated with respect to The OPTICS algorithm, MinPts and a clustering-distance The OPTICS algorithm, the density-based clustering is performed with this order.

The summarized pseudocode for the OPTICS algorithm with a couple of supporter functions are as follows:

The OPTICS algorithm
The OPTICS algorithm
The OPTICS algorithm
The OPTICS algorithm

The R implementation

Please take a look at the R codes file ch_06_optics.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_optics.R")

Clustering web pages

Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.

Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.

The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.

The OPTICS algorithm

OPTICS randomly selects an object from the input dataset as the current data object, p. In addition, for the object, p, the The OPTICS algorithm-neighborhood is fetched, the core-distance is calculated, and the reachability-distance is undefined.

Given p as a core data object, OPTICS recalculates the reachability-distance for any object, q (in the neighborhood of p), from p and inserts q into OrderSeeds if q has not been processed.

Once the augmented cluster ordering of the input dataset is generated with respect to The OPTICS algorithm, MinPts and a clustering-distance The OPTICS algorithm, the density-based clustering is performed with this order.

The summarized pseudocode for the OPTICS algorithm with a couple of supporter functions are as follows:

The OPTICS algorithm
The OPTICS algorithm
The OPTICS algorithm
The OPTICS algorithm

The R implementation

Please take a look at the R codes file ch_06_optics.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_optics.R")

Clustering web pages

Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.

Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.

The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.

The R implementation

Please take a look at the R codes file ch_06_optics.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_optics.R")

Clustering web pages

Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.

Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.

The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.

Clustering web pages

Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.

Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.

The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.

Visitor analysis in the browser cache and DENCLUE

DENsity-based CLUstEring (DENCLUE) is a density-based clustering algorithm that depends on the support of density-distribution functions.

Before a detailed explanation on the DENCLUE algorithm, some concepts need to be introduced; they are influence function, density function, gradient, and density attractor.

The influence function of a specific data object can be any function for which the Gaussian kernel is usually used as the kernel at the data point.

The density function at a point, x, is defined as the sum of the influence functions of all the data objects at this data point.

A point is defined as a density attractor if it is a local maximum of the density function and is computed as

Visitor analysis in the browser cache and DENCLUE
Visitor analysis in the browser cache and DENCLUE

A gradient of the density function is defined in the following equation, given the density function, Visitor analysis in the browser cache and DENCLUE.

Visitor analysis in the browser cache and DENCLUE

DENCLUE defines a density function for the data point space at first. All the local maxima data points are searched and found. Assign each data point to the nearest local maxima point to maximize the density related to it. Each group of data points bound with a local maxima point is defined as a cluster. As a postprocess, the cluster is discarded if its bound local maxima density is lower than the user-predefined value. The clusters are merged if there exists a path such that each point on the path has a higher density value than the user-predefined value.

The DENCLUE algorithm

The summarized pseudocodes for the DENCLUE algorithm is as follows:

The DENCLUE algorithm

The R implementation

Please take a look at the R codes file ch_06_denclue.R from the bundle of R codes for previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_denclue.R")

Visitor analysis in the browser cache

The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.

The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:

Hit

This refers to each element of a web page downloaded to a viewer's web browser (such as Internet Explorer, Mozilla, or Netscape). Hits do not correspond in any direct fashion to the number of pages viewed or number of visitors to a site. For example, if a viewer downloads a web page with three graphics, the web logfile will show four hits: one for the web page and one for each of the three graphics.

Unique Visitors

The actual number of viewers to the website that came from a unique IP address (see IP address in this table).

New/Return Visitors

The number of first-time visitors to the site compared to returning visitors.

Page views

The number of times a specified web page has been viewed; shows exactly what content people are (or are not) viewing at a website. Every time a visitor hits the page refresh button, another page view is logged.

Page views per visitor

The number of page views divided by the number of visitors; measures how many pages viewers look at each time they visit a website.

IP address

A numeric identifier for a computer. (The format of an IP address is a 32-bit numeric address written as four numbers separated by periods; each number can be zero to 255. For example, 1.160.10.240 could be an IP address.) The IP address can be used to determine a viewer's origin (that is, by country); it also can be used to determine the particular computer network a website's visitors are coming from.

Visitor location

The geographic location of the visitor.

Visitor language

The language setting on the visitor's computer.

Referring pages/sites (URLs)

Indicates how visitors get to a website (that is, whether they type the URL, or web address, directly into a web browser or if they click through from a link at another site).

Keywords

If the referring URL is a search engine, the keywords (search string) that the visitor used can be determined.

Browser type

The type of browser software a visitor is using (that is, Netscape, Mozilla, Internet Explorer, and so on)

Operating system version

The specific operating system the site visitor uses.

Screen resolution

The display settings for the visitor's computer.

Java or Flash-enabled

Whether or not the visitor's computer allows Java (a programming language for applications on the Web) and/or Flash (a software tool that allows web pages to be displayed with animation, or motion).

Connection speed

Whether visitors are accessing the website from a slower dial-up connection, high-speed broadband, or Tl.

Errors

The number of errors recorded by the server, such as a "404-file not found" error; can be used to identify broken links and other problems at the website.

Visit duration

Average time spent on the site (length the visitor stays on the site before leaving). Sites that retain visitors longer are referred to as "sticky" sites.

Visitor paths/navigation

How visitors navigate the website, by specific pages, most common entry pages (the first page accessed by a visitor at a website) and exit points (the page from which a visitor exits a Website), and so on. For example, if a large number of visitors leave the site after looking at a particular page, the analyst might infer that they either found the information they needed, or alternatively, there might be a problem with that page (is it the page where shipping and handling fees are posted, which maybe are large enough to turn visitors away?).

Bounce rate

The percentage of visitors who leave the site after the first page; calculated by the number of visitors who visit only a single page divided by the number of total visits. The bounce rate is sometimes used as another indicator of "stickiness."

The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.

The DENCLUE algorithm

The summarized pseudocodes for the DENCLUE algorithm is as follows:

The DENCLUE algorithm

The R implementation

Please take a look at the R codes file ch_06_denclue.R from the bundle of R codes for previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_denclue.R")

Visitor analysis in the browser cache

The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.

The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:

Hit

This refers to each element of a web page downloaded to a viewer's web browser (such as Internet Explorer, Mozilla, or Netscape). Hits do not correspond in any direct fashion to the number of pages viewed or number of visitors to a site. For example, if a viewer downloads a web page with three graphics, the web logfile will show four hits: one for the web page and one for each of the three graphics.

Unique Visitors

The actual number of viewers to the website that came from a unique IP address (see IP address in this table).

New/Return Visitors

The number of first-time visitors to the site compared to returning visitors.

Page views

The number of times a specified web page has been viewed; shows exactly what content people are (or are not) viewing at a website. Every time a visitor hits the page refresh button, another page view is logged.

Page views per visitor

The number of page views divided by the number of visitors; measures how many pages viewers look at each time they visit a website.

IP address

A numeric identifier for a computer. (The format of an IP address is a 32-bit numeric address written as four numbers separated by periods; each number can be zero to 255. For example, 1.160.10.240 could be an IP address.) The IP address can be used to determine a viewer's origin (that is, by country); it also can be used to determine the particular computer network a website's visitors are coming from.

Visitor location

The geographic location of the visitor.

Visitor language

The language setting on the visitor's computer.

Referring pages/sites (URLs)

Indicates how visitors get to a website (that is, whether they type the URL, or web address, directly into a web browser or if they click through from a link at another site).

Keywords

If the referring URL is a search engine, the keywords (search string) that the visitor used can be determined.

Browser type

The type of browser software a visitor is using (that is, Netscape, Mozilla, Internet Explorer, and so on)

Operating system version

The specific operating system the site visitor uses.

Screen resolution

The display settings for the visitor's computer.

Java or Flash-enabled

Whether or not the visitor's computer allows Java (a programming language for applications on the Web) and/or Flash (a software tool that allows web pages to be displayed with animation, or motion).

Connection speed

Whether visitors are accessing the website from a slower dial-up connection, high-speed broadband, or Tl.

Errors

The number of errors recorded by the server, such as a "404-file not found" error; can be used to identify broken links and other problems at the website.

Visit duration

Average time spent on the site (length the visitor stays on the site before leaving). Sites that retain visitors longer are referred to as "sticky" sites.

Visitor paths/navigation

How visitors navigate the website, by specific pages, most common entry pages (the first page accessed by a visitor at a website) and exit points (the page from which a visitor exits a Website), and so on. For example, if a large number of visitors leave the site after looking at a particular page, the analyst might infer that they either found the information they needed, or alternatively, there might be a problem with that page (is it the page where shipping and handling fees are posted, which maybe are large enough to turn visitors away?).

Bounce rate

The percentage of visitors who leave the site after the first page; calculated by the number of visitors who visit only a single page divided by the number of total visits. The bounce rate is sometimes used as another indicator of "stickiness."

The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.

The R implementation

Please take a look at the R codes file ch_06_denclue.R from the bundle of R codes for previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_denclue.R")

Visitor analysis in the browser cache

The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.

The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:

Hit

This refers to each element of a web page downloaded to a viewer's web browser (such as Internet Explorer, Mozilla, or Netscape). Hits do not correspond in any direct fashion to the number of pages viewed or number of visitors to a site. For example, if a viewer downloads a web page with three graphics, the web logfile will show four hits: one for the web page and one for each of the three graphics.

Unique Visitors

The actual number of viewers to the website that came from a unique IP address (see IP address in this table).

New/Return Visitors

The number of first-time visitors to the site compared to returning visitors.

Page views

The number of times a specified web page has been viewed; shows exactly what content people are (or are not) viewing at a website. Every time a visitor hits the page refresh button, another page view is logged.

Page views per visitor

The number of page views divided by the number of visitors; measures how many pages viewers look at each time they visit a website.

IP address

A numeric identifier for a computer. (The format of an IP address is a 32-bit numeric address written as four numbers separated by periods; each number can be zero to 255. For example, 1.160.10.240 could be an IP address.) The IP address can be used to determine a viewer's origin (that is, by country); it also can be used to determine the particular computer network a website's visitors are coming from.

Visitor location

The geographic location of the visitor.

Visitor language

The language setting on the visitor's computer.

Referring pages/sites (URLs)

Indicates how visitors get to a website (that is, whether they type the URL, or web address, directly into a web browser or if they click through from a link at another site).

Keywords

If the referring URL is a search engine, the keywords (search string) that the visitor used can be determined.

Browser type

The type of browser software a visitor is using (that is, Netscape, Mozilla, Internet Explorer, and so on)

Operating system version

The specific operating system the site visitor uses.

Screen resolution

The display settings for the visitor's computer.

Java or Flash-enabled

Whether or not the visitor's computer allows Java (a programming language for applications on the Web) and/or Flash (a software tool that allows web pages to be displayed with animation, or motion).

Connection speed

Whether visitors are accessing the website from a slower dial-up connection, high-speed broadband, or Tl.

Errors

The number of errors recorded by the server, such as a "404-file not found" error; can be used to identify broken links and other problems at the website.

Visit duration

Average time spent on the site (length the visitor stays on the site before leaving). Sites that retain visitors longer are referred to as "sticky" sites.

Visitor paths/navigation

How visitors navigate the website, by specific pages, most common entry pages (the first page accessed by a visitor at a website) and exit points (the page from which a visitor exits a Website), and so on. For example, if a large number of visitors leave the site after looking at a particular page, the analyst might infer that they either found the information they needed, or alternatively, there might be a problem with that page (is it the page where shipping and handling fees are posted, which maybe are large enough to turn visitors away?).

Bounce rate

The percentage of visitors who leave the site after the first page; calculated by the number of visitors who visit only a single page divided by the number of total visits. The bounce rate is sometimes used as another indicator of "stickiness."

The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.

Visitor analysis in the browser cache

The browser-cache analysis provides the website owner with the convenience that shows the best matched part to the visitors, and at the same time, it is related to their privacy protection. The data instances in this context are browser caches, sessions, cookies, various logs, and so on.

The possible factors included in certain data instances can be the Web address, IP address (denotes the position where the visitor comes from), the duration for which the visitor stayed on a specific page, the pages the user visited, the sequence of the visited pages, the date and time of every visit, and so on. The log can be specific to a certain website or to various websites. A more detailed description is given in the following table:

Hit

This refers to each element of a web page downloaded to a viewer's web browser (such as Internet Explorer, Mozilla, or Netscape). Hits do not correspond in any direct fashion to the number of pages viewed or number of visitors to a site. For example, if a viewer downloads a web page with three graphics, the web logfile will show four hits: one for the web page and one for each of the three graphics.

Unique Visitors

The actual number of viewers to the website that came from a unique IP address (see IP address in this table).

New/Return Visitors

The number of first-time visitors to the site compared to returning visitors.

Page views

The number of times a specified web page has been viewed; shows exactly what content people are (or are not) viewing at a website. Every time a visitor hits the page refresh button, another page view is logged.

Page views per visitor

The number of page views divided by the number of visitors; measures how many pages viewers look at each time they visit a website.

IP address

A numeric identifier for a computer. (The format of an IP address is a 32-bit numeric address written as four numbers separated by periods; each number can be zero to 255. For example, 1.160.10.240 could be an IP address.) The IP address can be used to determine a viewer's origin (that is, by country); it also can be used to determine the particular computer network a website's visitors are coming from.

Visitor location

The geographic location of the visitor.

Visitor language

The language setting on the visitor's computer.

Referring pages/sites (URLs)

Indicates how visitors get to a website (that is, whether they type the URL, or web address, directly into a web browser or if they click through from a link at another site).

Keywords

If the referring URL is a search engine, the keywords (search string) that the visitor used can be determined.

Browser type

The type of browser software a visitor is using (that is, Netscape, Mozilla, Internet Explorer, and so on)

Operating system version

The specific operating system the site visitor uses.

Screen resolution

The display settings for the visitor's computer.

Java or Flash-enabled

Whether or not the visitor's computer allows Java (a programming language for applications on the Web) and/or Flash (a software tool that allows web pages to be displayed with animation, or motion).

Connection speed

Whether visitors are accessing the website from a slower dial-up connection, high-speed broadband, or Tl.

Errors

The number of errors recorded by the server, such as a "404-file not found" error; can be used to identify broken links and other problems at the website.

Visit duration

Average time spent on the site (length the visitor stays on the site before leaving). Sites that retain visitors longer are referred to as "sticky" sites.

Visitor paths/navigation

How visitors navigate the website, by specific pages, most common entry pages (the first page accessed by a visitor at a website) and exit points (the page from which a visitor exits a Website), and so on. For example, if a large number of visitors leave the site after looking at a particular page, the analyst might infer that they either found the information they needed, or alternatively, there might be a problem with that page (is it the page where shipping and handling fees are posted, which maybe are large enough to turn visitors away?).

Bounce rate

The percentage of visitors who leave the site after the first page; calculated by the number of visitors who visit only a single page divided by the number of total visits. The bounce rate is sometimes used as another indicator of "stickiness."

The analysis of a visitor is basically history sniffing, which is used for user-behavior analysis.

Recommendation system and STING

STatistical Information Grid (STING) is a grid-based clustering algorithm. The dataset is recursively divided into a hierarchy structure. The whole input dataset serves as the root node in the hierarchy structure. Each cell/unit in a layer is composed of a couple of cells/units in the lower layer. An example is shown in the following diagram:

Recommendation system and STING

To support the query for a dataset, the statistical information of each unit is calculated in advance for further processing; this information is also called statistics parameters.

The characteristics of STING algorithms are (but not limited to) the following:

  • A query-independent structure
  • Intrinsically parallelizable
  • Efficiency

The STING algorithm

The summarized pseudocodes for the STING algorithm are as follows:

The STING algorithm

The R implementation

Please take a look at the R codes file ch_06_sting.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_sting.R")

Recommendation systems

Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:

Recommendation systems

The STING algorithm

The summarized pseudocodes for the STING algorithm are as follows:

The STING algorithm

The R implementation

Please take a look at the R codes file ch_06_sting.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_sting.R")

Recommendation systems

Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:

Recommendation systems

The R implementation

Please take a look at the R codes file ch_06_sting.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_sting.R")

Recommendation systems

Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:

Recommendation systems

Recommendation systems

Depending on statistical, data-mining, and knowledge-discovery techniques, recommendation systems are being used by most of the e-commerce sites to make it easy for consumers to find products to purchase. Three main parts: representation of input data, neighborhood formation, and recommendation generation are shown in the following diagram:

Recommendation systems

Web sentiment analysis and CLIQUE

CLustering In QUEst (CLIQUE) is a bottom-up and grid-based clustering algorithm. The idea behind this algorithm is the Apriori feature, that is, the monotonicity of dense units with respect to dimensionality. If a set of data points, S, is a cluster in a k-dimensional projection of the space, then S is also contained in a cluster in any (k-1)-dimensional projections of this space.

The algorithm proceeds by passes. The one-dimensional dense units are produced by one pass through the data. The candidate k-dimensional units are generated using the candidate-generation procedure and the determined (k-l)-dimensional dense units that are fetched at the (k-1) pass.

The characteristics of the CLIQUE algorithm are as follows:

  • Efficient for high-dimensional datasets
  • Interpretability of results
  • Scalability and usability

The CLIQUE algorithm contains three steps to cluster a dataset. First, a group of subspaces is selected to cluster the dataset. Then, clustering is independently executed in every subspace. Finally, a concise summary of every cluster is produced in the form of a disjunctive normal form (DNF) expression.

The CLIQUE algorithm

The summarized pseudocode for the CLIQUE algorithm is as follows:

  1. Indentification of subspaces that contain clusters.
  2. Indentification of cluster.
  3. Generation minimal description for the cluster.

The candidate-generation algorithm is illustrated as follows:

The CLIQUE algorithm

Here is the algorithm to find the connected components of the graph; this is equivalent to finding clusters:

The CLIQUE algorithm

The R implementation

Please take a look at the R codes file ch_06_clique.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_clique.R")

Web sentiment analysis

Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.

Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.

The CLIQUE algorithm

The summarized pseudocode for the CLIQUE algorithm is as follows:

  1. Indentification of subspaces that contain clusters.
  2. Indentification of cluster.
  3. Generation minimal description for the cluster.

The candidate-generation algorithm is illustrated as follows:

The CLIQUE algorithm

Here is the algorithm to find the connected components of the graph; this is equivalent to finding clusters:

The CLIQUE algorithm

The R implementation

Please take a look at the R codes file ch_06_clique.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_clique.R")

Web sentiment analysis

Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.

Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.

The R implementation

Please take a look at the R codes file ch_06_clique.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_clique.R")

Web sentiment analysis

Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.

Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.

Web sentiment analysis

Web sentiment analysis is used to identify the idea or thought behind the text, for example, the sentiment analysis of microblogs such as Twitter. One simple example is comparing the post with a predefined labeled word list for sentiment judging. Another example is that we can judge a movie review as thumbs up or thumbs down.

Web sentiment analyses are used in news article analyses of biases pertaining to specific views, newsgroup evaluation, and so on.

Opinion mining and WAVE clustering

The WAVE clustering algorithm is a grid-based clustering algorithm. It depends on the relation between spatial dataset and multidimensional signals. The idea is that the cluster in a multidimensional spatial dataset turns out to be more distinguishable after a wavelet transformation, that is, after applying wavelets to the input data or the preprocessed input dataset. The dense part segmented by the sparse area in the transformed result represents clusters.

The characteristics of the WAVE cluster algorithm are as follows:

  • Efficient for a large dataset
  • Efficient for finding various shapes of clusters
  • Insensitive to noise or outlier
  • Insensitive with respect to the input order of a dataset
  • Multiresolution, which is introduced by wavelet transforms
  • Applicable to any numerical dataset

The WAVE cluster algorithm performs a few steps. In the first step, it creates a grid and assigns each data object from the input dataset to a unit in the grid. And then, transform the data to a new space by applying wavelet transform functions. Third, find the connected component in the new space. Map the cluster label for related data object to the original data space.

The WAVE cluster algorithm

The summarized pseudocodes for the WAVE cluster algorithm are as follows:

The WAVE cluster algorithm

The R implementation

Please take a look at the R codes file ch_06_wave.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_wave.R")

Opinion mining

An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.

Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.

An opinion-orientation algorithm can be listed as follows:

  • Identify opinion words and phrases
  • Handle negation
  • The But clause
  • Aggregating opinion
Opinion mining

The preceding formula denotes the opinion orientation on a certain feature, Opinion mining, where Opinion mining is an opinion word in s, Opinion mining is the distance between Opinion mining and Opinion mining. Opinion mining is the opinion orientation.

The WAVE cluster algorithm

The summarized pseudocodes for the WAVE cluster algorithm are as follows:

The WAVE cluster algorithm

The R implementation

Please take a look at the R codes file ch_06_wave.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_wave.R")

Opinion mining

An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.

Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.

An opinion-orientation algorithm can be listed as follows:

  • Identify opinion words and phrases
  • Handle negation
  • The But clause
  • Aggregating opinion
Opinion mining

The preceding formula denotes the opinion orientation on a certain feature, Opinion mining, where Opinion mining is an opinion word in s, Opinion mining is the distance between Opinion mining and Opinion mining. Opinion mining is the opinion orientation.

The R implementation

Please take a look at the R codes file ch_06_wave.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_wave.R")

Opinion mining

An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.

Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.

An opinion-orientation algorithm can be listed as follows:

  • Identify opinion words and phrases
  • Handle negation
  • The But clause
  • Aggregating opinion
Opinion mining

The preceding formula denotes the opinion orientation on a certain feature, Opinion mining, where Opinion mining is an opinion word in s, Opinion mining is the distance between Opinion mining and Opinion mining. Opinion mining is the opinion orientation.

Opinion mining

An entity possesses a couple of features. Features might be explicit or implicit. If a person or organization expresses an opinion, then the person or organization should be an opinion holder. An opinion specific to a feature is a positive or negative view, attitude, emotion, or appraisal of the feature from the opinion holder. Whether the opinion on a feature is positive, negative, or neutral denotes opinion orientation.

Opinion mining means mining the opinion on certain features of the object or entity under research. The simplest case is to judge the opinion as positive or negative.

An opinion-orientation algorithm can be listed as follows:

  • Identify opinion words and phrases
  • Handle negation
  • The But clause
  • Aggregating opinion
Opinion mining

The preceding formula denotes the opinion orientation on a certain feature, Opinion mining, where Opinion mining is an opinion word in s, Opinion mining is the distance between Opinion mining and Opinion mining. Opinion mining is the opinion orientation.

User search intent and the EM algorithm

The Expectation Maximization (EM) algorithm is a probabilistic-model-based clustering algorithm that depends on the mixture model in which the data is modeled by a mixture of simple models. The parameters related to these models are estimated by Maximum Likelihood Estimation (MLE).

Mixture models assume that the data is the result of the combination of various simple probabilistic distribution functions. Given K distribution functions and the jth distribution with the parameter, User search intent and the EM algorithm, User search intent and the EM algorithm is the set of User search intent and the EM algorithm of all distributions:

User search intent and the EM algorithm

The EM algorithm performs in the following way. In the first step, an initial group of model parameters are selected. The expectation step is the second step that performs the calculation of the probability:

User search intent and the EM algorithm

The previous equation represents the probability of each data object belonging to each distribution. Maximization is the third step. With the result of the expectation step, update the estimation of the parameters with the ones that maximize the expected likelihood:

User search intent and the EM algorithm
User search intent and the EM algorithm

The expectation step and maximization step are performed repeatedly until the output matches the ending condition, that is, the changes of the parameter estimations below a certain threshold.

The EM algorithm

The summarized pseudocode for the EM algorithm are as follows:

The EM algorithm

The R implementation

Please take a look at the R codes file ch_06_em.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_em.R")

The user search intent

Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.

User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.

To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.

Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.

The EM algorithm

The summarized pseudocode for the EM algorithm are as follows:

The EM algorithm

The R implementation

Please take a look at the R codes file ch_06_em.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_em.R")

The user search intent

Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.

User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.

To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.

Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.

The R implementation

Please take a look at the R codes file ch_06_em.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_em.R")

The user search intent

Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.

User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.

To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.

Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.

The user search intent

Determining the user search intent is an important yet a difficult issue with respect to the sparse data available concerning the searcher and queries.

User intent has a wide range of applications, cluster query refinements, user intention profiles, and web search intent induction. Given Web search engine queries, finding user intention is also a key and requirement.

To determine the user interest and preferences, the clicks' sequence upon the search result can be used as a good base.

Web search personalization is another important application with the user search intention. It is related to user context and intention. With the user intention applied, more effective and efficient information will be provided.

Customer purchase data analysis and clustering high-dimensional data

For high-dimensional data-space clustering, two major issues occur: efficiency and quality. New algorithms are needed to deal with this type of dataset. Two popular strategies are applied to it. One is the subspace-clustering strategy to find the cluster in the subspace of the original dataset space. Another is the dimensionality-reduction strategy, which is a lower dimensional data space created for further clustering.

MAFIA is an efficient and scalable subspace-clustering algorithm for high-dimensional and large datasets.

The MAFIA algorithm

The summarized pseudocode for the MAFIA algorithm is as follows:

The MAFIA algorithm

The summarized pseudocode for the parallel MAFIA algorithm is as follows:

The MAFIA algorithm

The SURFING algorithm

The summarized pseudocodes for the SURFING algorithm are as follows. It selects interesting features from the original attributes of the dataset.

The SURFING algorithm

The R implementation

Please take a look at the R codes file ch_06_surfing.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_surfing.R")

Customer purchase data analysis

Customer purchase data analysis contains many applications such as the customer-satisfaction analysis.

From the customer purchase data analysis, one application helps find the unwanted consumption or user's purchase behavior.

The MAFIA algorithm

The summarized pseudocode for the MAFIA algorithm is as follows:

The MAFIA algorithm

The summarized pseudocode for the parallel MAFIA algorithm is as follows:

The MAFIA algorithm

The SURFING algorithm

The summarized pseudocodes for the SURFING algorithm are as follows. It selects interesting features from the original attributes of the dataset.

The SURFING algorithm

The R implementation

Please take a look at the R codes file ch_06_surfing.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_surfing.R")

Customer purchase data analysis

Customer purchase data analysis contains many applications such as the customer-satisfaction analysis.

From the customer purchase data analysis, one application helps find the unwanted consumption or user's purchase behavior.

The SURFING algorithm

The summarized pseudocodes for the SURFING algorithm are as follows. It selects interesting features from the original attributes of the dataset.

The SURFING algorithm

The R implementation

Please take a look at the R codes file ch_06_surfing.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_surfing.R")

Customer purchase data analysis

Customer purchase data analysis contains many applications such as the customer-satisfaction analysis.

From the customer purchase data analysis, one application helps find the unwanted consumption or user's purchase behavior.

The R implementation

Please take a look at the R codes file ch_06_surfing.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_surfing.R")

Customer purchase data analysis

Customer purchase data analysis contains many applications such as the customer-satisfaction analysis.

From the customer purchase data analysis, one application helps find the unwanted consumption or user's purchase behavior.

Customer purchase data analysis

Customer purchase data analysis contains many applications such as the customer-satisfaction analysis.

From the customer purchase data analysis, one application helps find the unwanted consumption or user's purchase behavior.

SNS and clustering graph and network data

The clustering for graph and network data has a wide application in modern life, such as social networking. However, more challenges crop up along with the needs. High computational cost, sophisticated graphs, and high dimensionality and sparsity are the major concerns. With some special transformations, the issues can be transformed into graph cut issues.

Structural Clustering Algorithm for Network (SCAN) is one of the algorithms that searches for well-connected components in the graph as clusters.

SNS and clustering graph and network data

The SCAN algorithm

The summarized pseudocodes for the SCAN algorithm are as follows:

The SCAN algorithm

The R implementation

Please take a look at the R codes file ch_06_scan.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_06_scan.R")

Social networking service (SNS)

Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.

The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.

The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.

The SCAN algorithm

The summarized pseudocodes for the SCAN algorithm are as follows:

The SCAN algorithm

The R implementation

Please take a look at the R codes file ch_06_scan.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_06_scan.R")

Social networking service (SNS)

Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.

The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.

The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.

The R implementation

Please take a look at the R codes file ch_06_scan.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_06_scan.R")

Social networking service (SNS)

Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.

The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.

The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.

Social networking service (SNS)

Social network has become the most popular online communication method nowadays. The analysis of SNS becomes important, because of the requirement of security, business, control, and so on.

The foundation of SNS is a graph theory, especially for SNS mining, such as finding social communities and abuse of SNS for bad purpose.

The clustering for SNS is an inherent application to find a community (or community detection). Random walk is another key technology for SNS analysis and is used to find communities.

Time for action

Here are some questions for you to know whether you have understood the concepts:

  • What is the DBSCAN algorithm?
  • What is the SCAN algorithm?
  • What is the STING algorithm?
  • What is the OPTICS algorithm?
  • What is the constraint-based cluster method?

Summary

In this chapter, we covered the following facts:

  • DBSCAN depends on the density-based description of clusters. With searching and measuring the density of data points, high density means high possibility of the existence of clusters, and others mean outliers or noise.
  • OPTICS produces the cluster ordering that consists of the order of the data objects together with the corresponding reachability values and core values.
  • You learned that DENCLUE is a clustering algorithm based on a set of specific density-distribution functions and can find arbitrary shape clusters. It first segments the dataset as cubes and identifies the local density-distribution function. You also learned that a hill-climbing algorithm is performed to retrieve the local maximum for each item in the related cube with which a cluster will be built.
  • We saw that STING is based on the grid-like data structure, and it segments the embedding spatial area of the input data points into rectangular units. It is mainly used for a spatial dataset.
  • You learned that CLIQUE is a grid-based clustering algorithm that finds subspaces of high-dimensional data, and it can find dense units in the high-dimensional data too.
  • You know that the WAVE cluster is a grid-based clustering algorithm based on the wavelet transformations. It has a multiresolution and is efficient for large dataset.
  • You learned that EM is a probabilistic-model-based clustering algorithm, where each data point with a probability indicates that it belongs to a cluster. It is based on the assumption that the attribute of the data point has values that is the linear combination of simple distributions.
  • Clustering high-dimensional data.
  • Clustering graph and network data.

In the next chapter, we will cover the major topics related to outlier detection and algorithms, and look at some of their examples.