Learning Data Mining with R

Learning Data Mining with R

By : Bater Makhabel

Buy this Book

Learning Data Mining with R

By: Bater Makhabel

Buy this Book

Overview of this book

<p>Being able to deal with the array of problems that you may encounter during complex statistical projects can be difficult. If you have only a basic knowledge of R, this book will provide you with the skills and knowledge to successfully create and customize the most popular data mining algorithms to overcome these difficulties.</p> <p>You will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. Discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on RHadoop projects. You will finish this book feeling confident in your ability to know which data mining algorithm to apply in any situation.</p>

Learning Data Mining with R

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Warming Up

Big data

Data source

Data mining

Social network mining

Why R?

Data attributes and description

Data cleaning

Data integration

Data dimension reduction

Data transformation and discretization

Visualization of results

Time for action

Summary

Mining Frequent Patterns, Associations, and Correlations

An overview of associations and patterns

Market basket analysis

Hybrid association rules mining

Mining sequence dataset

The R implementation

High-performance algorithms

Time for action

Summary

Classification

Generic decision tree induction

High-value credit card customers classification using ID3

Web spam detection using C4.5

Web key resource page judgment using CART

Trojan traffic identification method and Bayes classification

Identify spam e-mail and Naïve Bayes classification

Rule-based classification of player types in computer games and rule-based classification

Time for action

Summary

Advanced Classification

Ensemble (EM) methods

Biological traits and the Bayesian belief network

Protein classification and the k-Nearest Neighbors algorithm

Document retrieval and Support Vector Machine

Classification using frequent patterns

Classification using the backpropagation algorithm

Time for action

Summary

Cluster Analysis

Search engines and the k-means algorithm

Automatic abstraction of document texts and the k-medoids algorithm

The CLARA algorithm

CLARANS

Unsupervised image categorization and affinity propagation clustering

News categorization and hierarchical clustering

Time for action

Summary

Advanced Cluster Analysis

Customer categorization analysis of e-commerce and DBSCAN

Clustering web pages and OPTICS

Visitor analysis in the browser cache and DENCLUE

Recommendation system and STING

Web sentiment analysis and CLIQUE

Opinion mining and WAVE clustering

User search intent and the EM algorithm

Customer purchase data analysis and clustering high-dimensional data

SNS and clustering graph and network data

Time for action

Summary

Outlier Detection

Credit card fraud detection and statistical methods

Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods

Intrusion detection and density-based methods

Intrusion detection and clustering-based methods

Monitoring the performance of the web server and classification-based methods

Detecting novelty in text, topic detection, and mining contextual outliers

Collective outliers on spatial data

Outlier detection in high-dimensional data

Time for action

Summary

Mining Stream, Time-series, and Sequence Data

The credit card transaction flow and STREAM algorithm

Predicting future prices and time-series analysis

Stock market data and time-series clustering and classification

Web click streams and mining symbolic sequences

Mining sequence patterns in transactional databases

Time for action

Summary

Graph Mining and Network Analysis

Graph mining

Mining frequent subgraph patterns

Social network mining

Time for action

Summary

Mining Text and Web Data

Text mining and TM packages

Text summarization

The question answering system

Genre categorization of web pages

Categorizing newspaper articles and newswires into topics

Web usage mining with web logs

Time for action

Summary

Algorithms and Data Structures

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Big data

Big data is large amount of data that does not fit in the memory of a single machine. In other words, the size of data itself becomes a part of the issue when studying it. Besides volume, two other major characteristics of big data are variety and velocity; these are the famous three Vs of big data. Velocity means data process rate or how fast the data is being processed. Variety denotes various data source types. Noises arise more frequently in big data source sets and affect the mining results, which require efficient data preprocessing algorithms.

As a result, distributed filesystems are used as tools for successful implementation of parallel algorithms on large amounts of data; it is a certainty that we will get even more data with each passing second. Data analytics and visualization techniques are the primary factors of the data mining tasks related to massive data. The characteristics of massive data appeal to many new data mining technique-related platforms, one of which is RHadoop. We'll be describing this in a later section.

Some data types that are important to big data are as follows:

The data from the camera video, which includes more metadata for analysis to expedite crime investigations, enhanced retail analysis, military intelligence, and so on.
The second data type is from embedded sensors, such as medical sensors, to monitor any potential outbreaks of virus.
The third data type is from entertainment, information freely published through social media by anyone.
The last data type is consumer images, aggregated from social medias, and tagging on these like images are important.

Here is a table illustrating the history of data size growth. It shows that information will be more than double every two years, changing the way researchers or companies manage and extract value through data mining techniques from data, revealing new data mining studies.

Year	Data Sizes	Comments
N/A		1 MB (Megabyte) = . The human brain holds about 200 MB of information.
N/A		1 PB (Petabyte) = . It is similar to the size of 3 years' observation data for Earth by NASA and is equivalent of 70.8 times the books in America's Library of Congress.
1999	1 EB	1 EB (Exabyte) = . The world produced 1.5 EB of unique information.
2007	281 EB	The world produced about 281 Exabyte of unique information.
2011	1.8 ZB	1 ZB (Zetabyte)= . This is all data gathered by human beings in 2011.
Very soon		1 YB(Yottabytes)= .

Scalability and efficiency

Efficiency, scalability, performance, optimization, and the ability to perform in real time are important issues for almost any algorithms, and it is the same for data mining. There are always necessary metrics or benchmark factors of data mining algorithms.

As the amount of data continues to grow, keeping data mining algorithms effective and scalable is necessary to effectively extract information from massive datasets in many data repositories or data streams.

The storage of data from a single machine to wide distribution, the huge size of many datasets, and the computational complexity of the data mining methods are all factors that drive the development of parallel and distributed data-intensive mining algorithms.

Learning Data Mining with R

By : Bater Makhabel

Learning Data Mining with R

By: Bater Makhabel

Overview of this book

Related Content you might be interested in

Current Title:

Learning Data Mining with R

Big data

Scalability and efficiency