Book Image

R: Mining spatial, text, web, and social media data

By : Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel
Book Image

R: Mining spatial, text, web, and social media data

By: Nathan H. Danneman, Richard Heimann, Pradeepta Mishra, Bater Makhabel

Overview of this book

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Table of Contents (6 chapters)

Chapter 8. Mining Stream, Time-series, and Sequence Data

In this chapter, you will learn how to write mining codes for stream data, time-series data, and sequence data.

The characteristics of stream, time-series, and sequence data are unique, that is, large and endless. It is too large to get an exact result; this means an approximate result will be achieved. The classic data-mining algorithm should be extended, or a new algorithm needs to be designed for this type of the dataset.

In relation to the mining of stream, time-series, and sequence data, there are some topics we can't avoid. They are association, frequent pattern, classification and clustering algorithms, and so on. In the following sections, we will go through these major topics.

In this chapter, we will cover the following topics;

  • The credit card transaction flow and STREAM algorithm
  • Predicting future prices and time-series analysis
  • Stock market data and time-series clustering and classification
  • Web click streams and mining symbolic sequences
  • Mining sequence patterns in transactional databases

The credit card transaction flow and STREAM algorithm

As we mentioned in the previous chapters, one kind of data source always requires a variety of predefined algorithms or a brand new algorithm to deal with. Streaming data behaves a bit different from a traditional dataset.

The streaming dataset comes from various sources in modern life, such as credit record transaction stream, web feeds, phone-call records, sensor data from a satellite or radar, network traffic data, a security event's stream, and a long running list of various data streams.

The targets to stream data processing are, and not limited to, summarization of the stream to specific extents.

With the characteristics of streaming data, the typical architecture to stream a management system is illustrated in the following diagram:

The credit card transaction flow and STREAM algorithm

The STREAM algorithm is a classical algorithm used to cluster stream data. In the next section, the details are present and explained by R code.

The STREAM algorithm

The summarized pseudocode of the STREAM algorithm are as follows:

The STREAM algorithm

In the preceding algorithm, LOCALSEARCH is a revised k-median algorithm.

The STREAM algorithm
The STREAM algorithm
The STREAM algorithm

The single-pass-any-time clustering algorithm

Here is a clustering algorithm designed to deal with a high-frequency news stream:

The single-pass-any-time clustering algorithm

The R implementation

Please take a look at the R codes file ch_08_stream.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_08_stream.R")

The credit card transaction flow

Day by day, the growing e-commerce market is driving the growth of the usage of credit card, which, in turn, is bringing in a large number of transaction streams. Fraudulent usage of credit cards happens every day; we want the algorithm to detect this kind of transaction in a very short time compared to the big volume of transactions. Besides this, the requirement to find out the valuable customers by an analysis of the transaction streams is becoming more and more important. However, it is harder to get valid information in a very short time in response to the advertisement needs, such as recommend the necessary finance services or goods to these customers.

The credit card transaction flow is also the process to generate stream data, and the related stream-mining algorithm can be applied with high accuracy.

One application for the credit card transaction flow mining is the behavior analysis of the consumers. With each transaction record that is stored, various costs or purchases of a certain person can be tracked. With the mining of such transaction records, we can provide an analysis to the credit card holder to help them keep financial balance or other financial target. We can also provide this analysis to the card issuer that published the credit card to create a new business, such as funding or loaning, or to the retailers (the merchants) such as Walmart to help in the arrangement of appropriate goods.

Another application for the credit card transaction flow mining is fraud detection. It is the most obvious one among tons of applications. Using this solution, the card issuer or the bank can reduce the rate of successful fraud.

The dataset of the credit card transaction includes the owner of cards, place of consumption, date, cost, and so on.

Predicting future prices and time-series analysis

Auto Regressive Integrated Moving Average (ARIMA) is a classic algorithm to analyze time–series data. As the initial step, an ARIMA model will be chosen to model the time-series data. Assuming that Predicting future prices and time-series analysis is a time-series variable such as a price, the formula is defined as follows, and it includes the main features of the variable:

Predicting future prices and time-series analysis
Predicting future prices and time-series analysis
Predicting future prices and time-series analysis
Predicting future prices and time-series analysis

The ARIMA algorithm

The summarized steps for the ARIMA algorithm is as follows:

  1. A class of models is formulated assuming certain hypotheses.
  2. A model is identified for the observed data.
  3. The model parameters are estimated.
  4. If the hypotheses of the model are validated, proceed with this step; otherwise, go to step 1 to refine the model. The model is now ready for forecasting.

Predicting future prices

Predicting future prices is one subproblem of predicting the future; this is just a sign of the difficulty of this issue. Another similar issue under this domain is estimating the future market demands.

The stock market is a good example of predicting the future price; here, the prices changes along with time. Predicting future prices helps estimate the equity returns in the future and also helps in deciding the timing for financial investors, such as when to buy/sell.

The price graph shows an oscillation. Many factors can affect this value, even the psycology of humans.

The key problem in this prediction is the huge data volume. As a direct result, the algorithms to be used for this topic need to be efficient. The other key problem is that the price might change dramatically in a short time. Also, the complexity of the market is definitely a big problem.

The data instances for the future price prediction include quantitative attributes such as technical factors, and they also include macroeconomics, microeconomics, political events, and investor expectations. Moreover, they include the domain expert's knowledge.

All the solutions based on the aggregate disperse information from the preselected factors.

Price is obviously affected by time, and is a time-series variable too. To forecast the future value, the time-series-analysis algorithm will apply. ARIMA can be used to forecast the prices of the next day.

One classical solution is to determine the trends of the market at a very early stage. Another solution is fuzzy solutions, which is also a good choice because of the huge volume of dataset and wide range of factors affecting the prices.

The ARIMA algorithm

The summarized steps for the ARIMA algorithm is as follows:

  1. A class of models is formulated assuming certain hypotheses.
  2. A model is identified for the observed data.
  3. The model parameters are estimated.
  4. If the hypotheses of the model are validated, proceed with this step; otherwise, go to step 1 to refine the model. The model is now ready for forecasting.

Predicting future prices

Predicting future prices is one subproblem of predicting the future; this is just a sign of the difficulty of this issue. Another similar issue under this domain is estimating the future market demands.

The stock market is a good example of predicting the future price; here, the prices changes along with time. Predicting future prices helps estimate the equity returns in the future and also helps in deciding the timing for financial investors, such as when to buy/sell.

The price graph shows an oscillation. Many factors can affect this value, even the psycology of humans.

The key problem in this prediction is the huge data volume. As a direct result, the algorithms to be used for this topic need to be efficient. The other key problem is that the price might change dramatically in a short time. Also, the complexity of the market is definitely a big problem.

The data instances for the future price prediction include quantitative attributes such as technical factors, and they also include macroeconomics, microeconomics, political events, and investor expectations. Moreover, they include the domain expert's knowledge.

All the solutions based on the aggregate disperse information from the preselected factors.

Price is obviously affected by time, and is a time-series variable too. To forecast the future value, the time-series-analysis algorithm will apply. ARIMA can be used to forecast the prices of the next day.

One classical solution is to determine the trends of the market at a very early stage. Another solution is fuzzy solutions, which is also a good choice because of the huge volume of dataset and wide range of factors affecting the prices.

Predicting future prices

Predicting future prices is one subproblem of predicting the future; this is just a sign of the difficulty of this issue. Another similar issue under this domain is estimating the future market demands.

The stock market is a good example of predicting the future price; here, the prices changes along with time. Predicting future prices helps estimate the equity returns in the future and also helps in deciding the timing for financial investors, such as when to buy/sell.

The price graph shows an oscillation. Many factors can affect this value, even the psycology of humans.

The key problem in this prediction is the huge data volume. As a direct result, the algorithms to be used for this topic need to be efficient. The other key problem is that the price might change dramatically in a short time. Also, the complexity of the market is definitely a big problem.

The data instances for the future price prediction include quantitative attributes such as technical factors, and they also include macroeconomics, microeconomics, political events, and investor expectations. Moreover, they include the domain expert's knowledge.

All the solutions based on the aggregate disperse information from the preselected factors.

Price is obviously affected by time, and is a time-series variable too. To forecast the future value, the time-series-analysis algorithm will apply. ARIMA can be used to forecast the prices of the next day.

One classical solution is to determine the trends of the market at a very early stage. Another solution is fuzzy solutions, which is also a good choice because of the huge volume of dataset and wide range of factors affecting the prices.

Stock market data and time-series clustering and classification

Time-series clustering has been proven to provide effective information for further research. In contrast to the classic clustering, the time-series dataset comprises data changed with time.

Stock market data and time-series clustering and classification

The hError algorithm

This algorithm is denoted as "clustering seasonality patterns in the presence of errors"; the summarized algorithm is listed in the following figure. The major characteristic of this algorithm is to introduce a specific distance function and a dissimilarity function.

The hError algorithm

Time-series classification with the 1NN classifier

The summarized pseudocode for the 1NN classifier algorithm is as follows:

Time-series classification with the 1NN classifier

The R implementation

Please take a look at the R codes file ch_08_herror.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_08_herror.R")

Stock market data

Stock market is a complex dynamic system, and many factors can affect the market. For example, breaking financial news is also a key factor for this topic.

The characteristics of stock market data are large volume (near infinite), real time, high dimensions, and high complexity. The various data streams from stock markets go along with many events and correlations.

Stock market sentiment analysis is one topic related to this domain. The stock market is too complex with many factors that can affect the market. People's opinion or sentiment is one of the major factors.

The real-time information needs from the stock market require the fast, efficient online-mining algorithms. The time-series data accounts for various data including the stock market data that updates along with time.

To predict the stock market, the past data is really important. Using the past return on certain stock, the future price of that stock can be predicted based on the price-data stream.

The hError algorithm

This algorithm is denoted as "clustering seasonality patterns in the presence of errors"; the summarized algorithm is listed in the following figure. The major characteristic of this algorithm is to introduce a specific distance function and a dissimilarity function.

The hError algorithm

Time-series classification with the 1NN classifier

The summarized pseudocode for the 1NN classifier algorithm is as follows:

Time-series classification with the 1NN classifier

The R implementation

Please take a look at the R codes file ch_08_herror.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_08_herror.R")

Stock market data

Stock market is a complex dynamic system, and many factors can affect the market. For example, breaking financial news is also a key factor for this topic.

The characteristics of stock market data are large volume (near infinite), real time, high dimensions, and high complexity. The various data streams from stock markets go along with many events and correlations.

Stock market sentiment analysis is one topic related to this domain. The stock market is too complex with many factors that can affect the market. People's opinion or sentiment is one of the major factors.

The real-time information needs from the stock market require the fast, efficient online-mining algorithms. The time-series data accounts for various data including the stock market data that updates along with time.

To predict the stock market, the past data is really important. Using the past return on certain stock, the future price of that stock can be predicted based on the price-data stream.

Time-series classification with the 1NN classifier

The summarized pseudocode for the 1NN classifier algorithm is as follows:

Time-series classification with the 1NN classifier

The R implementation

Please take a look at the R codes file ch_08_herror.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_08_herror.R")

Stock market data

Stock market is a complex dynamic system, and many factors can affect the market. For example, breaking financial news is also a key factor for this topic.

The characteristics of stock market data are large volume (near infinite), real time, high dimensions, and high complexity. The various data streams from stock markets go along with many events and correlations.

Stock market sentiment analysis is one topic related to this domain. The stock market is too complex with many factors that can affect the market. People's opinion or sentiment is one of the major factors.

The real-time information needs from the stock market require the fast, efficient online-mining algorithms. The time-series data accounts for various data including the stock market data that updates along with time.

To predict the stock market, the past data is really important. Using the past return on certain stock, the future price of that stock can be predicted based on the price-data stream.

The R implementation

Please take a look at the R codes file ch_08_herror.R from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_08_herror.R")

Stock market data

Stock market is a complex dynamic system, and many factors can affect the market. For example, breaking financial news is also a key factor for this topic.

The characteristics of stock market data are large volume (near infinite), real time, high dimensions, and high complexity. The various data streams from stock markets go along with many events and correlations.

Stock market sentiment analysis is one topic related to this domain. The stock market is too complex with many factors that can affect the market. People's opinion or sentiment is one of the major factors.

The real-time information needs from the stock market require the fast, efficient online-mining algorithms. The time-series data accounts for various data including the stock market data that updates along with time.

To predict the stock market, the past data is really important. Using the past return on certain stock, the future price of that stock can be predicted based on the price-data stream.

Stock market data

Stock market is a complex dynamic system, and many factors can affect the market. For example, breaking financial news is also a key factor for this topic.

The characteristics of stock market data are large volume (near infinite), real time, high dimensions, and high complexity. The various data streams from stock markets go along with many events and correlations.

Stock market sentiment analysis is one topic related to this domain. The stock market is too complex with many factors that can affect the market. People's opinion or sentiment is one of the major factors.

The real-time information needs from the stock market require the fast, efficient online-mining algorithms. The time-series data accounts for various data including the stock market data that updates along with time.

To predict the stock market, the past data is really important. Using the past return on certain stock, the future price of that stock can be predicted based on the price-data stream.

Web click streams and mining symbolic sequences

Web click streams data is large and continuously emerging, with hidden trends buried and to be discovered for various usages, such as recommendations. TECNO-STREAMS (Tracking Evolving Clusters in NOisy Streams) is a one-pass algorithm.

The TECNO-STREAMS algorithm

The whole algorithm is modeled on the following equations: the robust weight or activation function (1), influence zone (2), pure simulation (3), optimal scale update (4), incremental update of pure simulation and optimal update (5) and (6), the simulation and scale values (7), and finally, the D-W-B-cell update equations (8).

The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm

The similarity measures applied in the learning phase are defined here:

The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm

The similarity measures applied in the validation phase are defined here:

The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm

The summarized pseudocode for the TECNO-STREAMS algorithm is as follows:

The TECNO-STREAMS algorithm

The R implementation

Please take a look at the R codes file ch_08_tecno_stream.R from the bundle of R codes for previous algorithm. The codes can be tested with the following command:

> source("ch_08_tecno_stream.R")

Web click streams

The web click streams denote the user's behavior when visiting the site, especially for e-commerce sites and CRM (Customer Relation Management). The analysis of web click streams will improve the user experience of the customer and optimize the structure of the site to meet the customers' expectation and, finally, increase the income of the site.

In other aspects, web click streams mining can be used to detect DoS attacks, track the attackers, and prevent these on the Web in advance.

The dataset for web click stream is obviously the click records that get generated when the user visits various sites. The major characteristics of this dataset are that it is huge, and the size goes on increasing.

The TECNO-STREAMS algorithm

The whole algorithm is modeled on the following equations: the robust weight or activation function (1), influence zone (2), pure simulation (3), optimal scale update (4), incremental update of pure simulation and optimal update (5) and (6), the simulation and scale values (7), and finally, the D-W-B-cell update equations (8).

The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm

The similarity measures applied in the learning phase are defined here:

The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm

The similarity measures applied in the validation phase are defined here:

The TECNO-STREAMS algorithm
The TECNO-STREAMS algorithm

The summarized pseudocode for the TECNO-STREAMS algorithm is as follows:

The TECNO-STREAMS algorithm

The R implementation

Please take a look at the R codes file ch_08_tecno_stream.R from the bundle of R codes for previous algorithm. The codes can be tested with the following command:

> source("ch_08_tecno_stream.R")

Web click streams

The web click streams denote the user's behavior when visiting the site, especially for e-commerce sites and CRM (Customer Relation Management). The analysis of web click streams will improve the user experience of the customer and optimize the structure of the site to meet the customers' expectation and, finally, increase the income of the site.

In other aspects, web click streams mining can be used to detect DoS attacks, track the attackers, and prevent these on the Web in advance.

The dataset for web click stream is obviously the click records that get generated when the user visits various sites. The major characteristics of this dataset are that it is huge, and the size goes on increasing.

The R implementation

Please take a look at the R codes file ch_08_tecno_stream.R from the bundle of R codes for previous algorithm. The codes can be tested with the following command:

> source("ch_08_tecno_stream.R")

Web click streams

The web click streams denote the user's behavior when visiting the site, especially for e-commerce sites and CRM (Customer Relation Management). The analysis of web click streams will improve the user experience of the customer and optimize the structure of the site to meet the customers' expectation and, finally, increase the income of the site.

In other aspects, web click streams mining can be used to detect DoS attacks, track the attackers, and prevent these on the Web in advance.

The dataset for web click stream is obviously the click records that get generated when the user visits various sites. The major characteristics of this dataset are that it is huge, and the size goes on increasing.

Web click streams

The web click streams denote the user's behavior when visiting the site, especially for e-commerce sites and CRM (Customer Relation Management). The analysis of web click streams will improve the user experience of the customer and optimize the structure of the site to meet the customers' expectation and, finally, increase the income of the site.

In other aspects, web click streams mining can be used to detect DoS attacks, track the attackers, and prevent these on the Web in advance.

The dataset for web click stream is obviously the click records that get generated when the user visits various sites. The major characteristics of this dataset are that it is huge, and the size goes on increasing.

Mining sequence patterns in transactional databases

Mining sequence patterns can be thought of as association discovery over the temporal data or sequence dataset. Similarly, the classic pattern-mining algorithm should be extended or modified as per the sequence dataset's scenario.

The PrefixSpan algorithm

PrefixSpan is a frequent sequence-mining algorithm. The summarized pseudocode for the PrefixSpan algorithm is as follows:

The PrefixSpan algorithm

The R implementation

Please take a look at the R codes file ch_08_prefix_span.R from the bundle of R codes for the previous algorithm. The codes can be tested with the following command:

> source("ch_08_prefix_span.R")

The PrefixSpan algorithm

PrefixSpan is a frequent sequence-mining algorithm. The summarized pseudocode for the PrefixSpan algorithm is as follows:

The PrefixSpan algorithm

The R implementation

Please take a look at the R codes file ch_08_prefix_span.R from the bundle of R codes for the previous algorithm. The codes can be tested with the following command:

> source("ch_08_prefix_span.R")

The R implementation

Please take a look at the R codes file ch_08_prefix_span.R from the bundle of R codes for the previous algorithm. The codes can be tested with the following command:

> source("ch_08_prefix_span.R")

Time for action

Here are some questions for you to know whether you have understood the concepts:

  • What is a stream?
  • What is a time series?
  • What is a sequence?

Summary

In this chapter, we looked at stream mining, time-series analysis, mining symbolic sequences, and mining sequence patterns.

The next chapter will cover the major topics related to graph mining, algorithms, and some examples related to them.