Book Image

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel
Book Image

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Table of Contents (6 chapters)

Chapter 10. Mining Text and Web Data

In this chapter, you will learn the algorithm written in R for text mining and web data mining.

For text mining, the semistructured and nonstructured documents are the main dataset. There are a few of major categories of text mining, such as clustering, document retrieval and representation, and anomaly detection. The application of text mining includes, but is not limited to, topic tracking, and text summarization and categorization.

Web content, structure, and usage mining is one application of web mining. Web mining is also used for user behavior modeling, personalized views and content annotation, and so on. In another aspect, web mining integrates the result information from the traditional data-mining technologies and the information from WWW.

In this chapter, we will cover the following topics:

  • Text mining and the TM package
  • Text summarization
  • The question answering system
  • Genre categorization of web pages
  • Categorization of newspaper articles and newswires into topics
  • Web usage mining with web logs

Text mining and TM packages

Along with the appearance of text mining, due to the characteristics of text or documents, the traditional data-mining algorithms need some minor adjustments or extensions. The classical text-mining process is as follows:

Text mining and TM packages

The popular text-clustering algorithms include the distance-based clustering algorithm, the hierarchical clustering algorithm, the partition-based clustering algorithm, and so on.

The popular text-classification algorithms include decision trees, pattern-based classification, SVM classification, Bayesian classification, and so on.

As a popular preprocessing step, here are the details of the word-extraction algorithm.

Text summarization

The target of text summarization is to generate a concise and coherent conclusion or summary of the major information of the input. Three major steps are performed in most of the summarization systems. They are as follows:

  1. Build a temporary structure that contains the main portion of the key point of the input text.
  2. Next, the sentences from the input are scored with the output of the first step.
  3. Lastly, a final summary that represents the input documents is set up by several sentences.

One popular strategy is to remove the unimportant information, clauses, or sentences and, at the same time, build classifiers to make sure that the key information is not thrown away, which is, in another viewpoint, the relative importance of topics functioned here during the summarization process. The final result is represented in a coherent way.

The summarization is a dynamic nonstop process. First, we need to build summaries on the set of old documents' dataset, which means multidocuments' summarization. The second step is to summarize the new documents when we get the result of the first step.

Due to the difficulty in building a new sentence for the summaries, one solution is extractive summarization, that is, to extract the most relevant sentences from the document dataset or documents. With the growth of the size of the documents set, the time evolution of the topics is also an important issue for which statistical topic modeling such as time-series-based algorithms are proposed.

There are two popular solutions for intermediate representation: topic representation and indicator representation. Sentence scoring, the score of each sentence, is determined by a couple of important factors such as the combination of characters. Summary sentence selection is determined by the most important N sentences.

There are many good characteristics that describe the final summary: it is indicative or informative, extract or abstract, generic or query-oriented, consists of background or just the news, is monolingual or cross lingual, and consists of a single document or multiple documents.

The benefit of text summarization is to improve the efficiency of document processing during which the summarization of a certain document can help the reader decide whether to analyze the document for specific purposes. One example is to summarize the multilingual, large (nearing unlimited), dynamic dataset of documents from various sources, including the Web. Examples include summarization of medical articles, e-mail, the Web, speech, and so on.

Topic representation

Topic representation such as topic signature plays an important role in the document-summarization system. Various topic representations are provided, such as topic signature, enhanced topic signature, thematic signature, and so on.

The topic signature is defined as a set of related terms; topic is the target concept, and signature is a list of terms that is related to the topic with a specific weight. Each term can be the stemmed content words, bigram or trigram:

Topic representation

The topic term selection process is as follows. The input document sets are divided into two sets: relevant and nonrelevant texts for the topic. Two hypotheses are defined:

Topic representation
Topic representation

The first hypothesis denotes that the relevance of a document is independent of the term, whereas the second hypothesis indicates a strong relevance with the presence of those terms, given that Topic representation and the 2 x 2 contingency table:

 

R

Topic representation
Topic representation Topic representation Topic representation
Topic representation Topic representation Topic representation

Topic representation represents the frequency of the term, Topic representation, occurring in R, and Topic representation, is the Topic representation frequency of the term occurring in Topic representation. Topic representation is the frequency of the term, Topic representation, occurring in R, and Topic representation is the frequency of the term, Topic representation, occurring in Topic representation.

The likelihood of both hypotheses is calculated as follows; here, b denotes the binomial distribution:

Topic representation
Topic representation
Topic representation

The algorithm to create the topic signature for a given topic is illustrated as follows:

Topic representation

The multidocument summarization algorithm

Here is the Graph-Based Sub-topic Partition Algorithm (GSPSummary) algorithm for multidocument summarization:

The multidocument summarization algorithm

The GSPRankMethod is defined as follows:

The multidocument summarization algorithm

The Maximal Marginal Relevance algorithm

Maximal Marginal Relevance (MMR), which is comparatively suited for query-based and multidocument summarization, selects the most important sentence in each iteration of sentence selection. Each selected sentence has minimal relevance to the selected sentence set.

The Maximal Marginal Relevance algorithm

The summarized algorithm of MMR is as follows:

The Maximal Marginal Relevance algorithm

The R implementation

Please take a look at the R codes file ch_10_mmr.R from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_10_mmr.R")

Topic representation

Topic representation such as topic signature plays an important role in the document-summarization system. Various topic representations are provided, such as topic signature, enhanced topic signature, thematic signature, and so on.

The topic signature is defined as a set of related terms; topic is the target concept, and signature is a list of terms that is related to the topic with a specific weight. Each term can be the stemmed content words, bigram or trigram:

Topic representation

The topic term selection process is as follows. The input document sets are divided into two sets: relevant and nonrelevant texts for the topic. Two hypotheses are defined:

Topic representation
Topic representation

The first hypothesis denotes that the relevance of a document is independent of the term, whereas the second hypothesis indicates a strong relevance with the presence of those terms, given that Topic representation and the 2 x 2 contingency table:

 

R

Topic representation
Topic representation Topic representation Topic representation
Topic representation Topic representation Topic representation

Topic representation represents the frequency of the term, Topic representation, occurring in R, and Topic representation, is the Topic representation frequency of the term occurring in Topic representation. Topic representation is the frequency of the term, Topic representation, occurring in R, and Topic representation is the frequency of the term, Topic representation, occurring in Topic representation.

The likelihood of both hypotheses is calculated as follows; here, b denotes the binomial distribution:

Topic representation
Topic representation
Topic representation

The algorithm to create the topic signature for a given topic is illustrated as follows:

Topic representation

The multidocument summarization algorithm

Here is the Graph-Based Sub-topic Partition Algorithm (GSPSummary) algorithm for multidocument summarization:

The multidocument summarization algorithm

The GSPRankMethod is defined as follows:

The multidocument summarization algorithm

The Maximal Marginal Relevance algorithm

Maximal Marginal Relevance (MMR), which is comparatively suited for query-based and multidocument summarization, selects the most important sentence in each iteration of sentence selection. Each selected sentence has minimal relevance to the selected sentence set.

The Maximal Marginal Relevance algorithm

The summarized algorithm of MMR is as follows:

The Maximal Marginal Relevance algorithm

The R implementation

Please take a look at the R codes file ch_10_mmr.R from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_10_mmr.R")

The multidocument summarization algorithm

Here is the Graph-Based Sub-topic Partition Algorithm (GSPSummary) algorithm for multidocument summarization:

The multidocument summarization algorithm

The GSPRankMethod is defined as follows:

The multidocument summarization algorithm

The Maximal Marginal Relevance algorithm

Maximal Marginal Relevance (MMR), which is comparatively suited for query-based and multidocument summarization, selects the most important sentence in each iteration of sentence selection. Each selected sentence has minimal relevance to the selected sentence set.

The Maximal Marginal Relevance algorithm

The summarized algorithm of MMR is as follows:

The Maximal Marginal Relevance algorithm

The R implementation

Please take a look at the R codes file ch_10_mmr.R from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_10_mmr.R")

The Maximal Marginal Relevance algorithm

Maximal Marginal Relevance (MMR), which is comparatively suited for query-based and multidocument summarization, selects the most important sentence in each iteration of sentence selection. Each selected sentence has minimal relevance to the selected sentence set.

The Maximal Marginal Relevance algorithm

The summarized algorithm of MMR is as follows:

The Maximal Marginal Relevance algorithm

The R implementation

Please take a look at the R codes file ch_10_mmr.R from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_10_mmr.R")

The R implementation

Please take a look at the R codes file ch_10_mmr.R from the bundle of R codes for the preceding algorithms. The codes can be tested with the following command:

> source("ch_10_mmr.R")

The question answering system

The question answering (QA) system is a one hot topic related to IR, IE, NLP, data mining, and so on. A QA system mines a large set of texts to find a short phrase or sentence that answers a user's question with a certain precision. If the source is the Web or, even further, the whole information world, the challenge of the QA system increases dramatically.

There are basically three kinds of QA systems: slot filling, for which the format of the query and answer are similar; limited domain, for which the domains are limited with dictionaries and ontologies; and open domain, for which there is no limitation of domains.

One of the conceptual architectures among various solutions is illustrated in the following image. The input of the QA system is natural language questions; the output of the system, that is, the answer to the input question, is provided in natural language too. The system is composed of three major parts: the user interface, the processing part for questions, and the generation part for answers.

Assuming the question to be a coherent sentence, the questions from the user interface are processed, and the final right query is generated after a question-term identification, such as word segmentation and key word extraction and expansion. The ontology base here serves to support query expansion.

The answer is selected or generated from the FAQ base by the query that is provided in the earlier steps.

The question answering system

One popular question answering system is Yahoo! Answers, where a huge number of questions are asked and answered every day.

Genre categorization of web pages

Genre categorization can be used for large article corpora and web pages. A genre can be defined in terms of purpose and physical form. It denotes any widely accepted categories of texts defined by common communicative purpose or other functional traits, and the categories are extensible. The web genre can also be defined based on the facets, complexity of the language, subjectivity, and number of graphics. Genre categorization has many applications such as improving search efficiency and satisfying the users' information need.

For the WWW or web pages, the genre can be defined as the usability of the miner and feasibility with respect to the efficiency.

There are some major challenges for this web page genre categorization. One is the instability of the Web itself. The second is the complex and unpredictable properties of web pages. The third is how to judge the genre for a specific web page. There are more challenges, but they are not listed here, or they will appear in future applications. For certain web pages, they might have multiple genres or no genre for existing recognized genres libraries.

Due to the fast pace of the evolution of the Web, new genres are continuously introduced to the current genre classes and the current genre classes are continuously updated and upgraded.

Possible solutions include, but are not limited to, Naïve Bayes, k-Nearest Neighbor, SVM, and tree nodes as classification methods.

Categorizing newspaper articles and newswires into topics

Articles and newswires denote the huge periodical source of events of knowledge at different periods of time. The classification of text is the preprocessing step to store all these documents into a specific corpus. The categorization of text is the base of text processing.

We will now introduce an N-gram-based text-classification algorithm. From a longer string, an N-character slice is called N-gram. The key point of this algorithm is the calculation of the profiles of the N-gram frequencies.

Before the introduction of the algorithm, here are the necessary illustrations of a couple of concepts adopted in the algorithm:

Categorizing newspaper articles and newswires into topics

The N-gram-based text categorization

The summarized pseudocodes for the N-gram-based text-categorization algorithm are as follows:

The N-gram-based text categorization
The N-gram-based text categorization

The N-gram frequency-generation algorithm is as follows:

The N-gram-based text categorization

The R implementation

Please take a look at the R codes file ch_10_ngram_classifier.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_ngram_classifier.R")

The N-gram-based text categorization

The summarized pseudocodes for the N-gram-based text-categorization algorithm are as follows:

The N-gram-based text categorization
The N-gram-based text categorization

The N-gram frequency-generation algorithm is as follows:

The N-gram-based text categorization

The R implementation

Please take a look at the R codes file ch_10_ngram_classifier.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_ngram_classifier.R")

The R implementation

Please take a look at the R codes file ch_10_ngram_classifier.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_ngram_classifier.R")

Web usage mining with web logs

Web usage mining denotes the discovery and analytics of patterns in web logs (such as system access logs) and transactions. The output is the relation of user interaction and resources on the Web. The user behavior can be identified based on this output. The web logs record the track of the web user's interaction with web servers, web proxy servers, and browsers.

The popular web usage mining process is illustrated in the images, and it includes three major steps: data collection and preprocessing, pattern discovery, and pattern analysis.

The preprocessing contains data cleaning, session identification, and data conversion. The pattern discovery includes path analysis, association rules, sequential patterns, and cluster and classification rules.

Web usage mining with web logs

The FCA-based association rule mining algorithm

The summarized pseudocodes for the FCA-based association rule mining algorithm are as follows:

The FCA-based association rule mining algorithm

The R implementation

Please take a look at the R codes file ch_10_fca.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_fca.R")

The FCA-based association rule mining algorithm

The summarized pseudocodes for the FCA-based association rule mining algorithm are as follows:

The FCA-based association rule mining algorithm

The R implementation

Please take a look at the R codes file ch_10_fca.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_fca.R")

The R implementation

Please take a look at the R codes file ch_10_fca.R from the bundle of R codes for the above algorithms. The codes can be tested with the following command:

> source("ch_10_fca.R")

Time for action

Here are some questions for you to know whether you have understood the concepts:

  1. What are the characteristics of text mining?
  2. What is the difference between text mining and data mining?
  3. What is the difference between web mining and data mining?

Summary

In this chapter, we looked at text mining, which is to find brand new or unknown information by extracting information from the documents dataset. We also looked at how text summarization presents the condensed result of the document set under research; in other words, takes a source, extracts contents, and presents the key contents in a condensed format that is sensitive to the final needs. Also, we looked at how genre classification discriminates between documents by form, style, and the targeted system or audience. We also covered the question answering system, topic detection, and web mining.

In this book, many useful data-mining algorithms are illustrated in the form of the R language, which has been there for years, even decades. The most popular algorithms are included with a detailed description. You can start working with the knowledge structure of the classical and living data-mining algorithms and solutions built with R.