R Data Mining

R Data Mining

Overview of this book

R is widely used to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more. This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. It will let you gain these powerful skills while immersing in a one of a kind data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques. While moving along the plot of the story you will effectively learn and practice on real data the various R packages commonly employed for this kind of tasks. You will also get the chance of apply some of the most popular and effective data mining models and algos, from the basic multiple linear regression to the most advanced Support Vector Machines. Unlike other data mining learning instruments, this book will effectively expose you the theory behind these models, their relevant assumptions and when they can be applied to the data you are facing. By the end of the book you will hold a new and powerful toolbox of instruments, exactly knowing when and how to employ each of them to solve your data mining problems and get the most out of your data. Finally, to let you maximize the exposure to the concepts described and the learning process, the book comes packed with a reproducible bundle of commented R scripts and a practical set of data mining models cheat sheets.

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Why to Choose R for Your Data Mining and Where to Start

What is R?

A bit of history

R's points of strength

Installing R and writing R code

Possible alternatives to write and run R code

R foundational notions

R's weaknesses and how to overcome them

Further references

Summary

A First Primer on Data Mining Analysing Your Bank Account Data

Acquiring and preparing your banking data

Summarizing your data with pivot-like tables

Visualizing your data with ggplot2

Further references

Summary

The Data Mining Process - CRISP-DM Methodology

The Crisp-DM methodology data mining cycle

Business understanding

Summary

Keeping the House Clean – The Data Mining Architecture

A general overview

Data sources

Databases and data warehouses

The data mining engine

User interface

How to build a data mining architecture in R

Further references

Summary

How to Address a Data Mining Problem – Data Cleaning and Validation

On a quiet day

Data cleaning

Further references

Summary

Looking into Your Data Eyes – Exploratory Data Analysis

Introducing summary EDA

Graphical EDA

Further references

Summary

Our First Guess – a Linear Regression

Defining a data modelling strategy

Applying linear regression to our data

Further references

Summary

A Gentle Introduction to Model Performance Evaluation

Defining model performance

Measuring performance in regression models

Measuring the performance in classification problems

A final general warning – training versus test datasets

Further references

Summary

Don't Give up – Power up Your Regression Including Multiple Variables

Moving from simple to multiple linear regression

Dimensionality reduction

Fitting a multiple linear model with R

Further references

Summary

A Different Outlook to Problems with Classification Models

What is classification and why do we need it?

Logistic regression

Support vector machines

References

Summary

The Final Clash – Random Forests and Ensemble Learning

Random forest

Ensemble learning

Applying estimated models on new data

A more structured approach to predictive analytics

Applying the majority vote ensemble technique on predicted data

Further references

Summary

Looking for the Culprit – Text Data Mining with R

Extracting data from a PDF file in R

Sentiment analysis

Developing wordclouds from text

Looking for context in text – analyzing document n-grams

Performing network analysis on textual data

Further references

Summary

Sharing Your Stories with Your Stakeholders through R Markdown

Principles of a good data mining report

Set up an rmarkdown report

Develop an R markdown report in RStudio

Rendering and sharing an R markdown report

Further references

Summary

Epilogue

Dealing with Dates, Relative Paths and Functions

Dealing with dates in R

Working directories and relative paths in R

Conditional statements

Customer Reviews

5 star

4 star

3 star

2 star

1 star

R's weaknesses and how to overcome them

When talking about R to an experienced tech guy, he will probably come out with two main objections to the language:

Its steep learning curve
Its difficulty in handling large datasets

You will soon discover that those are actually the two main weaknesses of the language. Nevertheless, not even pretending that R is a perfect language, we are going to tackle those weaknesses here, showing effective ways to overcome them. We can actually consider the first of the mentioned objections temporary, at least on an individual basis, since once the user gets through the valley of despair, he will never come back to it and the weakness will be forgotten. You do not know about the valley of despair? Let me show you a plot, and then we can discuss it:

It is common wisdom that every man who starts to learn something new and complex enough will go through three different phases:

The honeymoon, where he falls in love with the new stuff and feels confident to be able to easily master it
The valley of despair, where everything starts looking impossible and disappointing
During the rest of the story, where he starts having a more realistic view of the new topic, his mastery of it starts increasing, and so does his level of confidence

Moving on to the second weakness, we have to say that R's difficulty in handling large datasets is a rather more structural aspect of the language, and therefore requires some structural changes to the language, and strategical cooperation between it and other tools. In two new paragraphs, we will go through both of the aforementioned weaknesses.

Learning R effectively and minimizing the effort

First of all, why is R perceived as a language that is difficult to learn? We don't have a universally accepted answer to this question. Nevertheless, we can try some reasoning on it. R is the main choice when talking about statistical data analysis and was indeed born as a language by statisticians for statisticians, and specifically for statistics students. This produced two specific features of the language:

No great care for the coding experience
A previously unseen range of statistical techniques applicable with the language, with an unprecedented level of interaction

Here, we can find reasons for the perceived steep learning curve: R wasn't conceived as a coder-friendly language, as, for instance, Julia and Swift were. Rather, it was an instrument born within the academic field for academic purposes, as we mentioned before. R's creators probably never expected their language to be employed for website development, as is the case today (you can refer to Chapter 13, Sharing your stories with your stakeholders through R markdown; take a look at the Shiny apps on this).

The second point is the feeling of disorientation that affects people, including statisticians, coming to R from other statistical analysis languages. Applying a statistical model to your data through R is an amazingly interactive process, where you get your data into a model, get results, and perform diagnostics on it. Then, you iterate once again or perform cross-validation techniques, all with a really high level of flexibility. This is not exactly what an SAS or SPSS user is used to. Within these two languages, you just take your data, send it to a function, and wait for a comprehensive and infinite set of results.

Is this the end of the story? Do we need to passively accept this history-rooted steep learning curve? Of course we don't, and the R community is actually actively involved in the task of leveling this curve, following two main paths:

Improving the R coding experience
Developing high-quality learning materials

The tidyverse

Due to it being widespread throughout the R community, it is almost impossible nowadays to talk about R without mentioning the so-called tidyverse. This original name stands for a framework of concepts and functions developed mainly by Hadley Wickham to bring R closer to a modern programming experience. Introducing you to the magical world of the tidyverse is out of the scope of this book, but I would like to briefly explain how the framework is composed. Within the tidyverse, at least the four following packages are usually included:

readr: For data import
dplyr: For data manipulation
tidyr: For data cleaning
ggplot2: For data visualization

Due to its great success, an ever-increasing amount of learning material has been created on this topic, and this leads us to the next paragraph.

Leveraging the R community to learn R

One of the most exciting aspects of the R world is the vital community surrounding it. In the beginning, the community was mainly composed of statisticians and academics who encountered this powerful tool through the course of their studies. Nowadays, while statisticians and academics are still in the game, the R community is also full of a great variety of professionals from different fields: from finance, to chemistry and genetics. It is commonly acknowledged that its community is one of the R language's peculiarities. This community is also a great asset for every newbie of the language, since it is composed of people who are generally friendly, rather than posh, and open to helping you with your first steps in the language. I guess this is, generally speaking, good news, but you may be wondering: How do I actually leverage this amazing community you are introducing me to? First of all, let us find them, looking at places - both virtual and physical - where you can experience the community. We will then look at practical ways to leverage community-driven content to learn R.

Where to find the R community

There are different places, both physical and virtual, where it is possible to communicate with the R community. The following is a tentative list to get you up and running:

Virtual places:

R-bloggers
Twitter hashtag #rstats
Google+ community
Stack Overflow R tagged questions
R-help mailing list

Physical places:

The annual R conference
The RStudio developer conference
The R meetup

Engaging with the community to learn R

Now that we know where to find the community, let's take a closer look at how to take advantage of it. We can distinguish three alternative and non-exclusive ways:

Employing community-driven learning material
Asking for help from the community
Staying ahead of language developments

Employing community-driven learning material: There are two main kinds of R learning materials developed by the community:

Papers, manuals, and books
Online interactive courses

Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books.

Let me point out to you the more useful ones:

Advanced R
R for Data Science
Introduction to Statistical Learning
OpenIntro Statistics
The R Journal

Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff.

Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question.

Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there):

Stack Overflow
R-help mailing list
R packages documentation

I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered.

If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers.

Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language.

Handling large datasets with R

The second weakness of those mentioned earlier was related to the handling of large datasets. Where does this weakness come from? It is something actually related to the core of the language—R is an in-memory software. This means that every object created and managed within an R script is stored within your computer RAM. This means that the total size of your data cannot be greater than the total size of your RAM (assuming that no other software is consuming your RAM, which is unrealistic). Answers to this problem are actually out of the scope of this book. Nevertheless, we can briefly summarize them into three main strategies:

Optimizing your code, profiling it with packages such as profvis, and applying programming best practices.
Relying on external data storage and wrangling tools, such as Spark, MongoDB, and Hadoop. We will reason a bit more on this in later chapters.
Changing R memory handling behavior, employing packages such as ff, filehash, R.huge, or bigmemory, that try to avoid RAM overloading.

The main point I would like to stress here is that even this weakness is actually superable. You should bear this in mind when you encounter it for the first time on your R mastery journey.

One final note: as long as the computational power price is getting lower, the issue related to large dataset handling will become a more negligible one.

R Data Mining

R Data Mining

Overview of this book

Related Content you might be interested in

Current Title:

R Data Mining

Advanced Analytics with R and Tableau

Regression Analysis with R

Mastering Machine Learning with R

R's weaknesses and how to overcome them

Learning R effectively and minimizing the effort

The tidyverse

Leveraging the R community to learn R

Where to find the R community

Engaging with the community to learn R

Handling large datasets with R