Book Image

R Data Mining

Book Image

R Data Mining

Overview of this book

R is widely used to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more. This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. It will let you gain these powerful skills while immersing in a one of a kind data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques. While moving along the plot of the story you will effectively learn and practice on real data the various R packages commonly employed for this kind of tasks. You will also get the chance of apply some of the most popular and effective data mining models and algos, from the basic multiple linear regression to the most advanced Support Vector Machines. Unlike other data mining learning instruments, this book will effectively expose you the theory behind these models, their relevant assumptions and when they can be applied to the data you are facing. By the end of the book you will hold a new and powerful toolbox of instruments, exactly knowing when and how to employ each of them to solve your data mining problems and get the most out of your data. Finally, to let you maximize the exposure to the concepts described and the learning process, the book comes packed with a reproducible bundle of commented R scripts and a practical set of data mining models cheat sheets.
Table of Contents (22 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
14
Epilogue

R's points of strength


You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular?

If looking at the root causes of R's popularity, we definitely have to mention these three:

  • Open source inside
  • Plugin ready
  • Data visualization friendly

Open source inside

One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well.

These attributes fit well for almost every target user of a statistical analysis language:

  • Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes
  • Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true
  • Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses 

Plugin ready

You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game.

There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to filesystem manipulation, statistical analysis, and data visualization.

While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages.

This is basically how the package development and sharing flow works:

  1. The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper.
  2. The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R-related documents and packages. 
  1. Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor):
install.packages("ggplot2")
library(ggplot2)

As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is.

More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community.

Data visualization friendly

 as a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data.

Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields.

R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks:

This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it.

Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers.

To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field.