Learning Predictive Analytics with R

Learning Predictive Analytics with R

By : Eric Mayor

Buy this Book

Learning Predictive Analytics with R

By: Eric Mayor

Buy this Book

Overview of this book

This book is packed with easy-to-follow guidelines that explain the workings of the many key data mining tools of R, which are used to discover knowledge from your data. You will learn how to perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. All chapters will guide you in acquiring the skills in a practical way. Most chapters also include a theoretical introduction that will sharpen your understanding of the subject matter and invite you to go further. The book familiarizes you with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, association rules, principal component analysis, multilevel modeling, k-NN, Naïve Bayes, decision trees, and text mining. It also provides a description of visualization techniques using the basic visualization tools of R as well as lattice for visualizing patterns in data organized in groups. This book is invaluable for anyone fascinated by the data mining opportunities offered by GNU R and its packages.

Learning Predictive Analytics with R

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Setting GNU R for Predictive Analytics

Installing GNU R

The R graphic user interface

The menu bar of the R console

Packages

Summary

Visualizing and Manipulating Data Using R

The roulette case

Histograms and bar plots

Scatterplots

Boxplots

Line plots

Application – Outlier detection

Formatting plots

Summary

Data Visualization with Lattice

Loading and discovering the lattice package

Discovering multipanel conditioning with xyplot()

Discovering other lattice plots

Updating graphics

Case study – exploring cancer-related deaths in the US

Summary

Cluster Analysis

Distance measures

Learning by doing – partition clustering with kmeans()

Using k-means with public datasets

Summary

Agglomerative Clustering Using hclust()

The inner working of agglomerative clustering

Agglomerative clustering with hclust()

Summary

Dimensionality Reduction with Principal Component Analysis

The inner working of Principal Component Analysis

Learning PCA in R

Summary

Exploring Association Rules with Apriori

Apriori – basic concepts

The inner working of apriori

Analyzing data with apriori in R

Summary

Probability Distributions, Covariance, and Correlation

Probability distributions

Covariance and correlation

Summary

Linear Regression

Understanding simple regression

Working with multiple regression

Analyzing data in R: correlation and regression

Robust regression

Bootstrapping

Summary

Classification with k-Nearest Neighbors and Naïve Bayes

Understanding k-NN

Working with k-NN in R

Understanding Naïve Bayes

Working with Naïve Bayes in R

Computing the performance of classification

Summary

Classification Trees

Understanding decision trees

ID3

C4.5

C5.0

Classification and regression trees and random forest

Conditional inference trees and forests

Installing the packages containing the required functions

Performing the analyses in R

Caret – a unified framework for classification

Summary

Multilevel Analyses

Nested data

Multilevel regression

Multilevel modeling in R

Predictions using multilevel models

Summary

Text Analytics with R

An introduction to text analytics

Loading the corpus

Data preparation

Creating the training and testing data frames

Classification of the reviews

Mining the news with R

Summary

Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML

Cross-validation and bootstrapping of predictive models using the caret package

Exporting models using PMML

Summary

Exercises and Solutions

Exercises

Solutions

Preface

The amount of data in the world is increasing exponentially as time passes. It is estimated that the total amount of data produced in 2020 will be 20 zettabytes (Kotov, 2014), that is, 20 billion terabytes. Organizations spend a lot of effort and money on collecting and storing data, and still, most of it is not analyzed at all, or not analyzed properly. One reason to analyze data is to predict the future, that is, to produce actionable knowledge. The main purpose of this book is to show you how to do that with reasonably simple algorithms. The book is composed of chapters describing the algorithms and their use and of an appendices with exercises and solutions to the exercises and references.

Prediction

What is meant by prediction? The answer, of course, depends on the field and the algorithms used, but this explanation is true most of the time—given the attested reliable relationships between indicators (predictors) and an outcome, the presence (or level) of the indicators for similar cases is a reliable clue to the presence (or level) of the outcome in the future. Here are some examples of relationships, starting with the most obvious:

Taller people weigh more
Richer individuals spend more
More intelligent individuals earn more
Customers in segment X buy more of product Y
Customers who bought product P will also buy product Q
Products P and Q are bought together
Some credit card transactions predict fraud (Chan et al., 1999)
Google search queries predict influenza infections (Ginsberg et al., 2009)
Tweet content predicts election poll outcomes (O'Connor and Balasubramanyan, 2010)

In the following section, we provide minimal definitions of the distinctions between supervised and unsupervised learning and classification and regression problems.

Supervised and unsupervised learning

Two broad families of algorithms will be discussed in this book:

Unsupervised learning algorithms
Supervised learning algorithms

Unsupervised learning

In unsupervised learning, the algorithm will seek to find the structure that organizes unlabelled data. For instance, based on similarities or distances between observations, an unsupervised cluster analysis will determine groups and which observations fit best into each of the groups. An application of this is, for instance, document classification.

Supervised learning

In supervised learning, we know the class or the level of some observations of a given target attribute. When performing a prediction, we use known relationships in labeled data (data for which we know what the class or level of the target attribute is) to predict the class or the level of the attribute in new cases (of which we do not know the value).

Classification and regression problems

There are basically two types of problems that predictive modeling deals with:

Classification problems
Regression problems

Classification

In some cases, we want to predict which group an observation is part of. Here, we are dealing with a quality of the observation. This is a classification problem. Examples include:

The prediction of the species of plants based on morphological measurements
The prediction of whether individuals will develop a disease or not, based on their health habits
The prediction of whether an e-mail is spam or not

Regression

In other cases, we want to predict an observation's level on an attribute. Here, we are dealing with a quantity, and this is a regression problem. Examples include:

The prediction of how much individuals will cost to health care based on their health habits
The prediction of the weight of animals based on their diets
The prediction of the number of defective devices based on manufacturing specifications

The role of field knowledge in data modeling

Of course, analyzing data without knowledge of the field is not a serious way to proceed. This is okay to show how some algorithms work, how to make use of them, and to exercise. However, for real-life applications, be sure that you know the topic well, or else consult experts for help. The Cross Industry Standard Process for Data Mining (CRISP-DM, Shearer, 2000) underlines the importance of field knowledge. The steps of the process are depicted as follows:

The Cross Industry Standard Process for Data Mining

As stressed upon in the preceding diagram, field knowledge (here called Business Understanding) informs and is informed by data understanding. The understanding of the data then informs how the data has to be prepared. The next step is data modeling, which can also lead to further data preparation. Data models have to be evaluated, and this evaluation can be informed by field knowledge (this is also stressed upon in the diagram), which is also updated through the data mining process. Finally, if the evaluation is satisfactory, the models are deployed for prediction. This book will focus on the data modeling and evaluation stages.

Caveats

Of course, predictions are not always accurate, and some have written about the caveats of data science. What do you think about the relationship between the attributes titled Predictor and Outcome on the following plot? It seems like there is a relationship between the two. For the statistically inclined, I tested its significance: r = 0.4195, p = .0024. The value p is the probability of obtaining a relationship of this strength or stronger if there is actually no relationship between the attributes. We could conclude that the relationship between these variables in the population they come from is quite reliable, right?

The relationship between the attributes titled Predictor and Outcome

Believe it or not, the population these observations come from is that of randomly generated numbers. We generated a data frame of 50 columns of 50 randomly generated numbers. We then examined all the correlations (manually) and generated a scatterplot of the two attributes with the largest correlation we found. The code is provided here, in case you want to check it yourself—line 1 sets the seed so that you find the same results as we did, line 2 generates to the data frame, line 3 fills it with random numbers, column by column, line 4 generates the scatterplot, line 5 fits the regression line, and line 6 tests the significance of the correlation:

1  set.seed(1)
2  DF = data.frame(matrix(nrow=50,ncol=50))
3  for (i in 1:50) DF[,i] = runif(50)
4  plot(DF[[2]],DF[[16]], xlab = "Predictor", ylab = "Outcome")
5  abline(lm(DF[[2]]~DF[[16]]))
6  cor.test(DF[[2]], DF[[16]])

How could this relationship happen given that the odds were 2.4 in 1000 ? Well, think of it; we correlated all 50 attributes 2 x 2, which resulted in 2,450 tests (not considering the correlation of each attribute with itself). Such spurious correlation was quite expectable. The usual threshold below which we consider a relationship significant is p = 0.05, as we will discuss in Chapter 8, Probability Distributions, Covariance, and Correlation. This means that we expect to be wrong once in 20 times. You would be right to suspect that there are other significant correlations in the generated data frame (there should be approximately 125 of them in total). This is the reason why we should always correct the number of tests. In our example, as we performed 2,450 tests, our threshold for significance should be 0.0000204 (0.05 / 2450). This is called the Bonferroni correction.

Spurious correlations are always a possibility in data analysis and this should be kept in mind at all times. A related concept is that of overfitting. Overfitting happens, for instance, when a weak classifier bases its prediction on the noise in data. We will discuss overfitting in the book, particularly when discussing cross-validation in Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML. All the chapters are listed in the following section.

We hope you enjoy reading the book and hope you learn a lot from us!

What this book covers

Chapter 1, Setting GNU R for Predictive Analytics, deals with the setting of R, how to load and install packages, and other basic operations. Only beginners should read this. If you are not a beginner, you will be bored! (Beginners should find the chapter entertaining).

Chapter 2, Visualizing and Manipulating Data Using R, deals with basic visualization functions in R and data manipulation. This chapter also aims to bring beginners up to speed for the rest of the book.

Chapter 3, Data Visualization with Lattice, deals with more advanced visualization functions. The concept of multipanel conditioning plots is presented. These allow you to examine the relationship between attributes as a function of group membership (for example, women versus men). A good working knowledge of R programming is necessary from this point.

Chapter 4, Cluster Analysis, presents the concept of clustering and the different types of clustering algorithms. It shows how to program and use a basic clustering algorithm (k-means) in R. Special attention is given to the description of distance measures and how to select the number of clusters for the analyses.

Chapter 5, Agglomerative Clustering Using hclust(), deals with hierarchical clustering. It shows how to use agglomerative clustering in R and the options to configure the analysis.

Chapter 6, Dimensionality Reduction with Principal Component Analysis, discusses the uses of PCA, notably dimension reduction. How to build a simple PCA algorithm, how to use PCA, and example applications are explored in the chapter.

Chapter 7, Exploring Association Rules with Apriori, focuses on the functioning of the apriori algorithm, how to perform the analyses, and how to interpret the outputs. Among other applications, association rules can be used to discover which products are frequently bought together (marked basket analysis).

Chapter 8, Probability Distributions, Covariance, and Correlation, discusses basic statistics and how they can be useful for prediction. The concepts given in the title are discussed without too much technicality, but formulas are proposed for the mathematically inclined.

Chapter 9, Linear Regression, builds upon the knowledge acquired in the previous chapter to show how to build a regression algorithm, including how to compute the coefficients and p values. The assumptions of linear regression (ordinary least squares) are rapidly discussed. The chapter then focuses on the use (and misuse) of regression.

Chapter 10, Classification with k-Nearest Neighbors and Naïve Bayes, deals with the classification problems of using two of the most popular algorithms. We build our own k-NN algorithm, with which we analyze the famous iris dataset. We also demonstrate how Naïve Bayes works. The chapter also deals with the use of both algorithms.

Chapter 11, Classification Trees, explores classification using no less than five classification tree algorithms: C4.5, C5, CART (classification part), random forests, and conditional inference trees. Entropy, information gain, pruning, bagging, and other important concepts are discussed.

Chapter 12, Multilevel Analyses, deals with the use of nested data. We will briefly discuss the functioning of multilevel regression (with mixed models), and will then focus on the important aspects in the analysis, notably, how to create and compare the models, and how to understand the outputs.

Chapter 13, Text Analytics with R, focuses on the use of some algorithms that we discussed in other chapters, as well as new ones, with the aim of analyzing text. We will start by showing you how to perform text preprocessing, we will explain important concepts, and then jump right into the analysis. We will highlight the importance of testing different algorithms on the same corpus.

Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML, deals with two important aspects, the first is ascertaining the validity of the models and the second is exporting the models for production. Training and testing datasets are used in most chapters. These are minimal requirements, and cross-validation as well as bootstrapping are significant improvements.

Appendix A, Exercises and Solutions, provides the exercises and the solutions for the chapters in the book.

Appendix B, Further Reading and References, it provides the references for the chapters in the book.

What you need for this book

All you need for this book is a working installation of R > 3.0 (on any operating system) and an active internet connection.

Who this book is for

If you are a statistician, chief information officer, data scientist, ML engineer, ML practitioner, quantitative analyst, or student of machine learning, this is the book for you. You should have basic knowledge of the use of R. Readers without previous experience of programming in R will also be able to use the tools in this book.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " Now, open the R script file called helloworld.R."

A block of code is set as follows:

print("Hello world")

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: " The File menu contains functions related to file handling."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/9352OS_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Learning Predictive Analytics with R

By : Eric Mayor

Learning Predictive Analytics with R

By: Eric Mayor

Overview of this book

Related Content you might be interested in

Current Title:

Learning Predictive Analytics with R

Preface

Prediction

Supervised and unsupervised learning

Unsupervised learning

Supervised learning

Classification and regression problems

Classification

Regression

The role of field knowledge in data modeling

Caveats

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

eBooks, discount offers, and more

Questions