Mastering Predictive Analytics with R

Predictive analytics, and data science more generally, currently enjoy a huge surge in interest, as predictive technologies such as spam filtering, word completion and recommendation engines have pervaded everyday life. We are now not only increasingly familiar with these technologies, but these technologies have also earned our confidence. Advances in computing technology in terms of processing power and in terms of software such as R and its plethora of specialized packages have resulted in a situation where users can be trained to work with these tools without needing advanced degrees in statistics or access to hardware that is reserved for corporations or university laboratories. This confluence of the maturity of techniques and the availability of supporting software and hardware has many practitioners of the field excited that they can design something that will make an appreciable impact on their own domains and businesses, and rightly so.

At the same time, many newcomers to the field quickly discover that there are many pitfalls that need to be overcome. Virtually no academic degree adequately prepares a student or professional to become a successful predictive modeler. The field draws upon many disciplines, such as computer science, mathematics, and statistics. Nowadays, not only do people approach the field with a strong background in only one of these areas, they also tend to be specialized within that area. Having taught several classes on the material in this book to graduate students and practicing professionals alike, I discovered that the two biggest fears that students repeatedly express are the fear of programming and the fear of mathematics. It is interesting that these are almost always mutually exclusive. Predictive analytics is very much a practical subject but one with a very rich theoretical basis, knowledge of which is essential to the practitioner. Consequently, achieving mastery in predictive analytics requires a range of different skills, from writing good software to implement a new technique or to preprocess data, to understanding the assumptions of a model, how it can be trained efficiently, how to diagnose problems, and how to tune its parameters to get better results.

It feels natural at this point to want to take a step back and think about what predictive analytics actually covers as a field. The truth is that the boundaries between this field and other related fields, such as machine learning, data mining, business analytics, data science and so on, are somewhat blurred. The definition we will use in this book is very broad. For our purposes, predictive analytics is a field that uses data to build models that predict a future outcome of interest. There is certainly a big overlap with the field of machine learning, which studies programs and algorithms that learn from data more generally. This is also true for data mining, whose goal is to extract knowledge and patterns from data. Data science is rapidly becoming an umbrella term that covers all of these fields, as well as topics such as information visualization to present the findings of data analysis, business concepts surrounding the deployment of models in the real world, and data management. This book may draw heavily from machine learning, but we will not cover the theoretical pursuit of the feasibility of learning, nor will we study unsupervised learning that sets out to look for patterns and clusters in data without a particular predictive target in mind. At the same time, we will also explore topics such as time series, which are not commonly discussed in a machine learning text.

R is an excellent platform to learn about predictive analytics and also to work on real-world problems. It is an open source project with an ever-burgeoning community of users. Together with Python, they are the two most commonly used languages by data scientists around the world at the time of this writing. It has a wealth of different packages that specialize in different modeling techniques and application domains, many of which are directly accessible from within R itself via a connection to the Comprehensive R Archive Network (CRAN). There are also ample online resources for the language, from tutorials to online courses. In particular, we'd like to mention the excellent Cross Validated forum (http://stats.stackexchange.com/) as well as the website R-bloggers (http://www.r-bloggers.com/), which hosts a fantastic collection of articles on using R from different blogs. For readers who are a little rusty, we provide a free online tutorial chapter that evolved from a set of lecture notes given to students at the Athens University of Economics and Business.

The primary mission of this book is to bridge the gap between low-level introductory books and tutorials that emphasize intuition and practice over theory, and high-level academic texts that focus on mathematics, detail, and rigor. Another equally important goal is to instill some good practices in you, such as learning how to properly test and evaluate a model. We also emphasize important concepts, such as the bias-variance trade-off and overfitting, which are pervasive in predictive modeling and come up time and again in various guises and across different models.

From a programming standpoint, even though we assume that you are familiar with the R programming language, every code sample has been carefully explained and discussed to allow readers to develop their confidence and follow along. That being said, it is not possible to overstress the importance of actually running the code alongside the book or at least before moving on to a new chapter. To make the process as smooth as possible, we have provided code files for every chapter in the book containing all the code samples in the text. In addition, in a number of places, we have written our own, albeit very simple implementations of certain techniques. Two examples that come to mind are the pocket perceptron algorithm in Chapter 4, Neural Networks and AdaBoost in Chapter 7, Ensemble Methods. In part, this is done in an effort to encourage users to learn how to write their own functions instead of always relying on existing implementations, as these may not always be available.

Reproducibility is a critical skill in the analysis of data and is not limited to educational settings. For this reason, we have exclusively used freely available data sets and have endeavored to apply specific seeds wherever random number generation has been needed. Finally, we have tried wherever possible to use data sets of a relatively small size in order to ensure that you can run the code while reading the book without having to wait too long, or force you to have access to better hardware than might be available to you. We will remind you that in the real world, patience is an incredibly useful virtue, as most data sets of interest will be larger than the ones we will study.

While each chapter ends in two or more practical modeling examples, every chapter begins with some theory and background necessary to understand a new model or technique. While we have not shied away from using mathematics to explain important details, we have been very mindful to introduce just enough to ensure that you understand the fundamental ideas involved. This is in line with the book's philosophy of bridging the gap to academic textbooks that go into more detail. Readers with a high-school background in mathematics should trust that they will be able to follow all of the material in this book with the aid of the explanations given. The key skills needed are basic calculus, such as simple differentiation, and key ideas in probability, such as mean, variance, correlation, as well as important distributions such as the binomial and normal distribution. While we don't provide any tutorials on these, in the early chapters we do try to take things particularly slowly. To address the needs of readers who are more comfortable with mathematics, we often provide additional technical details in the form of tips and give references that act as natural follow-ups to the discussion.

Sometimes, we have had to give an intuitive explanation of a concept in order to conserve space and avoid creating a chapter with an undue emphasis on pure theory. Wherever this is done, such as with the backpropagation algorithm in Chapter 4, Neural Networks, we have ensured that we explained enough to allow the reader to have a firm-enough hold on the basics to tackle a more detailed piece. At the same time, we have given carefully selected references, many of which are articles, papers, or online texts that are both readable and freely available. Of course, we refer to seminal textbooks wherever necessary.

The book has no exercises, but we hope that you will engage your curiosity to its maximum potential. Curiosity is a huge boon to the predictive modeler. Many of the websites from which we obtain data that we analyze have a number of other data sets that we do not investigate. We also occasionally show how we can generate artificial data to demonstrate the proof of concept behind a particular technique. Many of the R functions to build and train models have other parameters for tuning that we don't have time to investigate. Packages that we employ may often contain other related functions to those that we study, just as there are usually alternatives available to the proposed packages themselves. All of these are excellent avenues for further investigation and experimentation. Mastering predictive analytics comes just as much from careful study as from personal inquiry and practice.

A common ask from students of the field is for additional worked examples to simulate the actual process an experienced modeler follows on a data set. In reality, a faithful simulation would take as many hours as the analysis took in the first place. This is because most of the time spent in predictive modeling is in studying the data, trying new features and preprocessing steps, and experimenting with different models on the result. In short, as we will see in Chapter 1, Gearing Up for Predictive Modeling, exploration and trial and error are key components of an effective analysis. It would have been entirely impractical to compose a book that shows every wrong turn or unsuccessful alternative that is attempted on every data set. Instead of this, we fervently recommend that readers treat every data analysis in this book as a starting point to improve upon, and continue this process on their own. A good idea is to try to apply techniques from other chapters to a particular data set in order to see what else might work. This could be anything, from simply applying a different transformation to an input feature to using a completely different model from another chapter.

As a final note, we should mention that creating polished and presentable graphics in order to showcase the findings of a data analysis is a very important skill, especially in the workplace. While R's base plotting capabilities cover the basics, they often lack a polished feel. For this reason, we have used the ggplot2 package, except where a specific plot is generated by a function that is part of our analysis. Although we do not provide a tutorial for this, all the code to generate the plots included in this book is provided in the supporting code files, and we hope that the user will benefit from this as well. A useful online reference for the ggplot2 package is the section on graphs in the Cookbook for R website (http://www.cookbook-r.com/Graphs).

What this book covers

Chapter 1, Gearing Up for Predictive Modeling, begins our journey by establishing a common language for statistical models and a number of important distinctions we make when categorizing them. The highlight of the chapter is an exploration of the predictive modeling process and through this, we showcase our first model, the k Nearest Neighbor (kNN) model.

Chapter 2, Linear Regression, introduces the simplest and most well-known approach to predicting a numerical quantity. The chapter focuses on understanding the assumptions of linear regression and a range of diagnostic tools that are available to assess the quality of a trained model. In addition, the chapter touches upon the important concept of regularization, which addresses overfitting, a common ailment of predictive models.

Chapter 3, Logistic Regression, extends the idea of a linear model from the previous chapter by introducing the concept of a generalized linear model. While there are many examples of such models, this chapter focuses on logistic regression as a very popular method for classification problems. We also explore extensions of this model for the multiclass setting and discover that this method works best for binary classification.

Chapter 4, Neural Networks, presents a biologically inspired model that is capable of handling both regression and classification tasks. There are many different kinds of neural networks, so this chapter devotes itself to the multilayer perceptron network. Neural networks are complex models, and this chapter focuses substantially on understanding the range of different configuration and optimization parameters that play a part in the training process.

Chapter 5, Support Vector Machines, builds on the theme of nonlinear models by studying support vector machines. Here, we discover a different way of thinking about classification problems by trying to fit our training data geometrically using maximum margin separation. The chapter also introduces cross-validation as an essential technique to evaluate and tune models.

Chapter 6, Tree-based Methods, covers decision trees, yet another family of models that have been successfully applied to regression and classification problems alike. There are several flavors of decision trees, and this chapter presents a number of different training algorithms, such as CART and C5.0. We also learn that tree-based methods offer unique benefits, such as built-in feature selection, support for missing data and categorical variables, as well as a highly interpretable output.

Chapter 7, Ensemble Methods, takes a detour from the usual motif of showcasing a new type of model, and instead tries to answer the question of how to effectively combine different models together. We present the two widely known techniques of bagging and boosting and introduce the random forest as a special case of bagging with trees.

Chapter 8, Probabilistic Graphical Models, tackles an active area of machine learning research, that of probabilistic graphical models. These models encode conditional independence relations between variables via a graph structure, and have been successfully applied to problems in a diverse range of fields, from computer vision to medical diagnosis. The chapter studies two main representatives, the Naïve Bayes model and the hidden Markov model. This last model, in particular, has been successfully used in sequence prediction problems, such as predicting gene sequences and labeling sentences with part of speech tags.

Chapter 9, Time Series Analysis, studies the problem of modeling a particular process over time. A typical application is forecasting the future price of crude oil given historical data on the price of crude oil over a period of time. While there are many different ways to model time series, this chapter focuses on ARIMA models while discussing a few alternatives.

Chapter 10, Topic Modeling, is unique in this book in that it presents topic modeling, an approach that has its roots in clustering and unsupervised learning. Nonetheless, we study how this important method can be used in a predictive modeling scenario. The chapter emphasizes the most commonly known approach to topic modeling, Latent Dirichlet Allocation (LDA).

Chapter 11, Recommendation Systems, wraps up the book by discussing recommendation systems that analyze the preferences of a set of users interacting with a set of items, in order to make recommendations. A famous example of this is Netflix, which uses a database of ratings made by its users on movie rentals to make movie recommendations. The chapter casts a spotlight on collaborative filtering, a purely data-driven approach to making recommendations.

Introduction to R, gives an introduction and overview of the R language. It is provided as a way for readers to get up to speed in order to follow the code samples in this book. This is available as an online chapter at https://www.packtpub.com/sites/default/files/downloads/Mastering_Predictive_Analytics_with_R_Chapter.

What you need for this book

The only strong requirement for running the code in this book is an installation of R. This is freely available from http://www.r-project.org/ and runs on all the major operating systems. The code in this book has been tested with R version 3.1.3.

All the chapters introduce at least one new R package that does not come with the base installation of R. We do not explicitly show the installation of R packages in the text, but if a package is not currently installed on your system or if it requires updating, you can install it with the install.packages() function. For example, the following command installs the tm package:

> install.packages("tm")

All the packages we use are available on CRAN. An Internet connection is needed to download and install them as well as to obtain the open source data sets that we use in our real-world examples. Finally, even though not absolutely mandatory, we recommend that you get into the habit of using an Integrated Development Environment (IDE) to work with R. An excellent offering is RStudio (http://www.rstudio.com/), which is open source.

Who this book is for

This book is intended for budding and seasoned practitioners of predictive modeling alike. Most of the material of this book has been used in lectures for graduates and working professionals as well as for R schools, so it has also been designed with the student in mind. Readers should be familiar with R, but even those who have never worked with this language should be able to pick up the necessary background by reading the online tutorial chapter. Readers unfamiliar with R should have had at least some exposure to programming languages such as Python. Those with a background in MATLAB will find the transition particularly easy. As mentioned earlier, the mathematical requirements for the book are very modest, assuming only certain elements from high school mathematics, such as the concepts of mean and variance and basic differentiation.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Finally, we'll use the sort() function of R with the index.return parameter set to TRUE."

A block of code is set as follows:

> iris_cor <- cor(iris_numeric)
> findCorrelation(iris_cor)
[1] 3
> findCorrelation(iris_cor, cutoff = 0.99)
integer(0)
> findCorrelation(iris_cor, cutoff = 0.80)
[1] 3 4

New terms and important words are shown in bold.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books — maybe a mistake in the text or the code — we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Mastering Predictive Analytics with R

By : Rui Miguel Forte, Rui Miguel Forte

Mastering Predictive Analytics with R

By: Rui Miguel Forte, Rui Miguel Forte

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Predictive Analytics with R

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions