Book Image

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel
Book Image

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Table of Contents (6 chapters)

Chapter 9. Graph Mining and Network Analysis

In this chapter, you will learn the algorithms written in R for graph mining and network analysis.

In this chapter, we will cover the following topics:

  • Graph mining
  • Mining frequent subgraph patterns
  • Social network mining
  • Social influence mining

Graph mining

Grouping, messaging, dating, and many other means are the major forms of social communication or the classic social behavior in the social network. All these concepts are modeled with graphs; that is, nodes, edges, and other attributes. Graph mining is developed to mine this information, which is similar to other types of information, such as biological information, and so on.

Graph

Graph G contains nodes V and edges E and is represented with an equation, G = (V, E). As per graph mining, there are some concepts that need to be clarified. There are two types of graphs: directed graphs, which have ordered pairs of vertices in the edge set, E, and undirected graphs.

Graph mining algorithms

Although the data instances under research are very different from the other data types that we saw earlier in this book, graph-mining algorithms still include frequent pattern (subgraph) mining, classification, and clustering.

In the next section, we will look at frequent subgraph patterns mining algorithm, links mining, and clustering.

Mining frequent subgraph patterns

The subgraph pattern or graph pattern is an important application of data mining; this is used for bioinformatics, social network analysis, and so on. Frequent subgraph patterns are patterns that occur frequently in a set of graphs or in a large graph.

The gPLS algorithm

The gPLS algorithm

The GraphSig algorithm

The GraphSig algorithm

The gSpan algorithm

The summarized pseudocode for the gSpan algorithm is as follows:

The gSpan algorithm

Rightmost path extensions and their supports

Rightmost path extensions and their supports

The subgraph isomorphism enumeration algorithm

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

The gPLS algorithm

The gPLS algorithm

The GraphSig algorithm

The GraphSig algorithm

The gSpan algorithm

The summarized pseudocode for the gSpan algorithm is as follows:

The gSpan algorithm

Rightmost path extensions and their supports

Rightmost path extensions and their supports

The subgraph isomorphism enumeration algorithm

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

The GraphSig algorithm

The GraphSig algorithm

The gSpan algorithm

The summarized pseudocode for the gSpan algorithm is as follows:

The gSpan algorithm

Rightmost path extensions and their supports

Rightmost path extensions and their supports

The subgraph isomorphism enumeration algorithm

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

The gSpan algorithm

The summarized pseudocode for the gSpan algorithm is as follows:

The gSpan algorithm

Rightmost path extensions and their supports

Rightmost path extensions and their supports

The subgraph isomorphism enumeration algorithm

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

Rightmost path extensions and their supports

Rightmost path extensions and their supports

The subgraph isomorphism enumeration algorithm

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

The subgraph isomorphism enumeration algorithm

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

The canonical checking algorithm

The canonical checking algorithm

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

The R implementation

Please take a look at the R codes file ch_09_gspan.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_gspan.R")

Social network mining

Social network is based on human interactions, from the most classical definition. The data instances collected in the social network have graph-like and temporal characteristics. There are two major strategies for data mining tasks for social networks: one is linkage-based or structure-based, and the other is content-based. The data instances collected in the social network also have two kinds of data instances: static and dynamic or times-series data, such as the tweets on Twitter. Due to the characteristics of the data instance of graphs, there are vast versatile algorithms developed to solve the challenges.

Community detection and the shingling algorithm

Community detection and the shingling algorithm
Community detection and the shingling algorithm
Community detection and the shingling algorithm
Community detection and the shingling algorithm

The node classification and iterative classification algorithms

The node classification and iterative classification algorithms

The second-order algorithm to reduce the number of iterations is as follows:

The node classification and iterative classification algorithms

The R implementation

Please take a look at the R codes file ch_09_shingling.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_shingling.R")

Community detection and the shingling algorithm

Community detection and the shingling algorithm
Community detection and the shingling algorithm
Community detection and the shingling algorithm
Community detection and the shingling algorithm

The node classification and iterative classification algorithms

The node classification and iterative classification algorithms

The second-order algorithm to reduce the number of iterations is as follows:

The node classification and iterative classification algorithms

The R implementation

Please take a look at the R codes file ch_09_shingling.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_shingling.R")

The node classification and iterative classification algorithms

The node classification and iterative classification algorithms

The second-order algorithm to reduce the number of iterations is as follows:

The node classification and iterative classification algorithms

The R implementation

Please take a look at the R codes file ch_09_shingling.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_shingling.R")

The R implementation

Please take a look at the R codes file ch_09_shingling.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_09_shingling.R")

Time for action

Here are some practice questions for you to check whether you have understood the concepts:

  • What is a graph?
  • What graph opportunities are used?
  • What is the PageRank algorithm, and what is its application in web search?

Summary

In this chapter, we looked at:

  • Graph mining. We also saw that the characteristics of graph data can be divided into frequent pattern mining, classification, and clustering
  • Mining frequent subgraph patterns is done to find the frequent patterns in a set of graphs or a single massive graph
  • Social network analysis includes a wide range of web applications with broad definitions, such as Facebook, LinkedIn, Google+, StackOverflow, and so on

In the next chapter, we will focus on the major topics related to web mining and algorithms and look at some examples based on them.