Learning Data Mining with R

Learning Data Mining with R

By : Bater Makhabel

Buy this Book

Learning Data Mining with R

By: Bater Makhabel

Buy this Book

Overview of this book

<p>Being able to deal with the array of problems that you may encounter during complex statistical projects can be difficult. If you have only a basic knowledge of R, this book will provide you with the skills and knowledge to successfully create and customize the most popular data mining algorithms to overcome these difficulties.</p> <p>You will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. Discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on RHadoop projects. You will finish this book feeling confident in your ability to know which data mining algorithm to apply in any situation.</p>

Learning Data Mining with R

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Warming Up

Big data

Data source

Data mining

Social network mining

Why R?

Data attributes and description

Data cleaning

Data integration

Data dimension reduction

Data transformation and discretization

Visualization of results

Time for action

Summary

Mining Frequent Patterns, Associations, and Correlations

An overview of associations and patterns

Market basket analysis

Hybrid association rules mining

Mining sequence dataset

The R implementation

High-performance algorithms

Time for action

Summary

Classification

Generic decision tree induction

High-value credit card customers classification using ID3

Web spam detection using C4.5

Web key resource page judgment using CART

Trojan traffic identification method and Bayes classification

Identify spam e-mail and Naïve Bayes classification

Rule-based classification of player types in computer games and rule-based classification

Time for action

Summary

Advanced Classification

Ensemble (EM) methods

Biological traits and the Bayesian belief network

Protein classification and the k-Nearest Neighbors algorithm

Document retrieval and Support Vector Machine

Classification using frequent patterns

Classification using the backpropagation algorithm

Time for action

Summary

Cluster Analysis

Search engines and the k-means algorithm

Automatic abstraction of document texts and the k-medoids algorithm

The CLARA algorithm

CLARANS

Unsupervised image categorization and affinity propagation clustering

News categorization and hierarchical clustering

Time for action

Summary

Advanced Cluster Analysis

Customer categorization analysis of e-commerce and DBSCAN

Clustering web pages and OPTICS

Visitor analysis in the browser cache and DENCLUE

Recommendation system and STING

Web sentiment analysis and CLIQUE

Opinion mining and WAVE clustering

User search intent and the EM algorithm

Customer purchase data analysis and clustering high-dimensional data

SNS and clustering graph and network data

Time for action

Summary

Outlier Detection

Credit card fraud detection and statistical methods

Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods

Intrusion detection and density-based methods

Intrusion detection and clustering-based methods

Monitoring the performance of the web server and classification-based methods

Detecting novelty in text, topic detection, and mining contextual outliers

Collective outliers on spatial data

Outlier detection in high-dimensional data

Time for action

Summary

Mining Stream, Time-series, and Sequence Data

The credit card transaction flow and STREAM algorithm

Predicting future prices and time-series analysis

Stock market data and time-series clustering and classification

Web click streams and mining symbolic sequences

Mining sequence patterns in transactional databases

Time for action

Summary

Graph Mining and Network Analysis

Graph mining

Mining frequent subgraph patterns

Social network mining

Time for action

Summary

Mining Text and Web Data

Text mining and TM packages

Text summarization

The question answering system

Genre categorization of web pages

Categorizing newspaper articles and newswires into topics

Web usage mining with web logs

Time for action

Summary

Algorithms and Data Structures

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

The necessity to handle many complex statistical analysis projects is hitting statisticians and analysts across the globe. Since there is an increasing interest in data analysis, R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions in the real world. With its constantly growing community and plethora of packages, R offers functionality to deal with a truly vast array of problems.

It's been decades since the R programming language was born, and it has become eminent and well known not only within the community of scientists but also in the wider community of developers. It has grown into a powerful tool to help developers produce efficient and consistent source code for data-related tasks. The R development team and independent contributors have created good documentation, so getting started with R programming isn't that hard.

To go further, you can use packages from the official R website. If you want to continually improve your level of expertise, you might read through a set of books that have been published in last couple of years. You should always bear in mind that creating high-level, secure, and internationally compliant code is more complex than the first application created in the beginning.

This book is designed to help you deal with an array of problems that you may encounter during complex statistical projects, which can be difficult. Topics in this book will include learning how to manipulate data with R using code snippets, mining frequent patterns, association, and correlations while working with R programs. This book will also provide for those with only a basic knowledge of R the skills and knowledge to successfully create and customize the most popular data mining algorithms. This will help overcome difficulties encountered and will ensure the most effective use of the R programming language on data mining algorithm development through its rich set of publicly available packages.

Each chapter of this book is intended to stand on its own, so feel free to jump to any chapter where you feel you need to get more in-depth knowledge about a particular topic. If you feel you missed something major, go back and read the earlier chapters. They are constructed in a way to grow your knowledge piece by piece.

Discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions based on the MapReduce algorithm. You will finish this book feeling confident in the ability that you know which data mining algorithm to apply in which situation.

I enjoy working with the R programming language for versatile data mining tasks developments and researches, and I am really happy to share my enthusiasm and expertise with you to help you make use of the language more effectively and comfortably use data mining algorithm developments and applications.

What this book covers

Chapter 1, Warming Up, gives you the overview of data mining, the relation of data mining to machine learning, and statistics. It illustrates basic data mining terms such as data definition and preprocessing.

Chapter 2, Mining Frequent Patterns, Associations, and Correlations, contains advanced and interesting algorithms required to learn mining frequent patterns, association rules, and correlation rules when working with R programs.

Chapter 3, Classification, helps you learn the classic classification algorithms written in the R language, covering various classification algorithms for different types of datasets.

Chapter 4, Advanced Classification, teaches you more classification algorithms, such as the Bayesian Belief Network, SVM, and k-Nearest Neighbors algorithm.

Chapter 5, Cluster Analysis, helps you learn how to implement the popular and classic algorithms for clustering, such as k-means, CLARA, and spectral algorithms.

Chapter 6, Advanced Cluster Analysis, shows the implementation of advanced algorithms for clustering that are related to hot topics in current industries, including EM, CLIQUE, DBSCAN, and so on.

Chapter 7, Outlier Detection, demonstrates the classic and popular algorithms used to detect outliers in real-world cases.

Chapter 8, Mining Stream, Time-series, and Sequence Data, explains these three hot topics with the most popular, classic, and top-ranking algorithms.

Chapter 9, Graph Mining and Network Analysis, shows you the overview of graphs and social mining algorithms, along with other interesting topics.

Chapter 10, Mining Text and Web Data, helps you learn the popular algorithms applied in domains with interesting applications.

Appendix, Algorithms and Data Structures, contains a list of algorithms and data structures to help you on your data mining journey.

What you need for this book

Any modern PC with Windows, Linux, or Mac OS should be sufficient to run the code samples given in this book. All of the software used in the book is open source and freely available on the Web, at http://www.r-project.org/.

Who this book is for

This book is intended for budding data scientists, quantitative analysts, and software engineers with only basic exposure to R and statistics. This book assumes familiarity with only the very basics of R, such as the main data types, simple functions, and how to move data around. No prior experience with data mining packages is necessary. However, you should have basic understanding of data mining concepts and processes.

Even if you are brand new to data mining, you will be able to master both the basic and the advanced implementations of data mining algorithms. You will learn how to select and apply the appropriate algorithms from various data mining algorithms to some specific datasets out of most of the datasets available for the real world.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and explanations of their meanings.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Clicking on the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can also find the code files for this book at https://github.com/batermj/learning-data-mining-with-r.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Learning Data Mining with R

By : Bater Makhabel

Learning Data Mining with R

By: Bater Makhabel

Overview of this book

Related Content you might be interested in

Current Title:

Learning Data Mining with R

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions