Book Image

R Data Mining

Book Image

R Data Mining

Overview of this book

R is widely used to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more. This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. It will let you gain these powerful skills while immersing in a one of a kind data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques. While moving along the plot of the story you will effectively learn and practice on real data the various R packages commonly employed for this kind of tasks. You will also get the chance of apply some of the most popular and effective data mining models and algos, from the basic multiple linear regression to the most advanced Support Vector Machines. Unlike other data mining learning instruments, this book will effectively expose you the theory behind these models, their relevant assumptions and when they can be applied to the data you are facing. By the end of the book you will hold a new and powerful toolbox of instruments, exactly knowing when and how to employ each of them to solve your data mining problems and get the most out of your data. Finally, to let you maximize the exposure to the concepts described and the learning process, the book comes packed with a reproducible bundle of commented R scripts and a practical set of data mining models cheat sheets.
Table of Contents (22 chapters)
Title Page
About the Author
About the Reviewers
Customer Feedback

R's weaknesses and how to overcome them

When talking about R to an experienced tech guy, he will probably come out with two main objections to the language:

  • Its steep learning curve
  • Its difficulty in handling large datasets

You will soon discover that those are actually the two main weaknesses of the language. Nevertheless, not even pretending that R is a perfect language, we are going to tackle those weaknesses here, showing effective ways to overcome them. We can actually consider the first of the mentioned objections temporary, at least on an individual basis, since once the user gets through the valley of despair, he will never come back to it and the weakness will be forgotten. You do not know about the valley of despair? Let me show you a plot, and then we can discuss it:

It is common wisdom that every man who starts to learn something new and complex enough will go through three different phases:

  • The honeymoon, where he falls in love with the new stuff and feels confident to be able to easily master it
  • The valley of despair, where everything starts looking impossible and disappointing
  • During the rest of the story, where he starts having a more realistic view of the new topic, his mastery of it starts increasing, and so does his level of confidence

Moving on to the second weakness, we have to say that R's difficulty in handling large datasets is a rather more structural aspect of the language, and therefore requires some structural changes to the language, and strategical cooperation between it and other tools. In two new paragraphs, we will go through both of the aforementioned weaknesses. 

Learning R effectively and minimizing the effort

First of all, why is R perceived as a language that is difficult to learn? We don't have a universally accepted answer to this question. Nevertheless, we can try some reasoning on it. R is the main choice when talking about statistical data analysis and was indeed born as a language by statisticians for statisticians, and specifically for statistics students. This produced two specific features of the language:

  • No great care for the coding experience
  • A previously unseen range of statistical techniques applicable with the language, with an unprecedented level of interaction

Here, we can find reasons for the perceived steep learning curve: R wasn't conceived as a coder-friendly language, as, for instance, Julia and Swift were. Rather, it was an instrument born within the academic field for academic purposes, as we mentioned before. R's creators probably never expected their language to be employed for website development, as is the case today (you can refer to Chapter 13, Sharing your stories with your stakeholders through R markdown; take a look at the Shiny apps on this).

The second point is the feeling of disorientation that affects people, including statisticians, coming to R from other statistical analysis languages. Applying a statistical model to your data through R is an amazingly interactive process, where you get your data into a model, get results, and perform diagnostics on it. Then, you iterate once again or perform cross-validation techniques, all with a really high level of flexibility. This is not exactly what an SAS or SPSS user is used to. Within these two languages, you just take your data, send it to a function, and wait for a comprehensive and infinite set of results.

Is this the end of the story? Do we need to passively accept this history-rooted steep learning curve? Of course we don't, and the R community is actually actively involved in the task of leveling this curve, following two main paths:

  • Improving the R coding experience
  • Developing high-quality learning materials

The tidyverse

Due to it being widespread throughout the R community, it is almost impossible nowadays to talk about R without mentioning the so-called tidyverse. This original name stands for a framework of concepts and functions developed mainly by Hadley Wickham to bring R closer to a modern programming experience. Introducing you to the magical world of the tidyverse is out of the scope of this book, but I would like to briefly explain how the framework is composed. Within the tidyverse, at least the four following packages are usually included: 

  • readr: For data import
  • dplyr: For data manipulation 
  • tidyr: For data cleaning 
  • ggplot2: For data visualization

Due to its great success, an ever-increasing amount of learning material has been created on this topic, and this leads us to the next paragraph.

Leveraging the R community to learn R

One of the most exciting aspects of the R world is the vital community surrounding it. In the beginning, the community was mainly composed of statisticians and academics who encountered this powerful tool through the course of their studies. Nowadays, while statisticians and academics are still in the game, the R community is also full of a great variety of professionals from different fields: from finance, to chemistry and genetics. It is commonly acknowledged that its community is one of the R language's peculiarities. This community is also a great asset for every newbie of the language, since it is composed of people who are generally friendly, rather than posh, and open to helping you with your first steps in the language. I guess this is, generally speaking, good news, but you may be wondering: How do I actually leverage this amazing community you are introducing me to? First of all, let us find them, looking at places - both virtual and physical - where you can experience the community. We will then look at practical ways to leverage community-driven content to learn R. 

Where to find the R community

There are different places, both physical and virtual, where it is possible to communicate with the R community. The following is a tentative list to get you up and running:

Virtual places:

  • R-bloggers
  • Twitter hashtag #rstats
  • Google+ community
  • Stack Overflow R tagged questions
  • R-help mailing list

Physical places:

  • The annual R conference
  • The RStudio developer conference
  • The R meetup
Engaging with the community to learn R

Now that we know where to find the community, let's take a closer look at how to take advantage of it. We can distinguish three alternative and non-exclusive ways:

  • Employing community-driven learning material
  • Asking for help from the community
  • Staying ahead of language developments

Employing community-driven learning material: There are two main kinds of R learning materials developed by the community:

  • Papers, manuals, and books
  • Online interactive courses

Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books.

Let me point out to you the more useful ones:

  • Advanced R
  • R for Data Science
  • Introduction to Statistical Learning
  • OpenIntro Statistics
  • The R Journal

Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff.

Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question. 

Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there):

  • Stack Overflow
  • R-help mailing list
  • R packages documentation

I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered.

If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers.

Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language.

Handling large datasets with R

The second weakness of those mentioned earlier was related to the handling of large datasets. Where does this weakness come from? It is something actually related to the core of the language—R is an in-memory software. This means that every object created and managed within an R script is stored within your computer RAM. This means that the total size of your data cannot be greater than the total size of your RAM (assuming that no other software is consuming your RAM, which is unrealistic). Answers to this problem are actually out of the scope of this book. Nevertheless, we can briefly summarize them into three main strategies:

  • Optimizing your code, profiling it with packages such as profvis, and applying programming best practices.
  • Relying on external data storage and wrangling tools, such as Spark, MongoDB, and Hadoop. We will reason a bit more on this in later chapters.
  • Changing R memory handling behavior, employing packages such as fffilehashR.huge, or bigmemory, that try to avoid RAM overloading.

The main point I would like to stress here is that even this weakness is actually superable. You should bear this in mind when you encounter it for the first time on your R mastery journey. 

One final note: as long as the computational power price is getting lower, the issue related to large dataset handling will become a more negligible one.