Book Image

Mastering Predictive Analytics with R - Second Edition

By : James D. Miller, Rui Miguel Forte
Book Image

Mastering Predictive Analytics with R - Second Edition

By: James D. Miller, Rui Miguel Forte

Overview of this book

R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems. The book begins with a dedicated chapter on the language of models and the predictive modeling process. You will understand the learning curve and the process of tidying data. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets. How do you train models that can handle really large datasets? This book will also show you just that. Finally, you will tackle the really important topic of deep learning by implementing applications on word embedding and recurrent neural networks. By the end of this book, you will have explored and tested the most popular modeling techniques in use on real- world datasets and mastered a diverse range of techniques in predictive analytics using R.
Table of Contents (22 chapters)
Mastering Predictive Analytics with R Second Edition
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
8
Dimensionality Reduction
Index

Preface

Predictive analytics incorporates a variety of statistical techniques from predictive modeling, machine learning, and data mining that aim to analyze current and historical facts to produce results referred to as predictions about the future or otherwise unknown events.

R is an open source programming language that is widely used among statisticians and data miners for predictive modeling and data mining. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems.

This book builds upon its first edition, meaning to be both a guide and a reference to the reader wanting to move beyond the basics of predictive modeling. The book begins with a dedicated chapter on the language of models as well as the predictive modeling process. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets.

This second edition provides up-to-date in-depth information on topics such as Performance Metrics and Learning Curves, Polynomial Regression, Poisson and Negative Binomial Regression, back-propagation, Radial Basis Function Networks, and more. A chapter has also been added that focuses on working with very large datasets. By the end of this book, you will have explored and tested the most popular modeling techniques in use on real-world datasets and mastered a diverse range of techniques in predictive analytics.

What this book covers

Chapter 1, Gearing Up for Predictive Modeling, helps you set up and get ready to start looking at individual models and case studies, then describes the process of predictive modeling in a series of steps, and introduces several fundamental distinctions.

Chapter 2, Tidying Data and Measuring Performance, covers performance metrics, learning curves, and a process for tidying data.

Chapter 3, Linear Regression, explains the classic starting point for predictive modeling; it starts from the simplest single variable model and moves on to multiple regression, over-fitting, regularization, and describes regularized extensions of linear regression.

Chapter 4, Generalized Linear Models, follows on from linear regression, and in this chapter, introduces logistic regression as a form of binary classification, extends this to multinomial logistic regression, and uses these as a platform to present the concepts of sensitivity and specificity.

Chapter 5, Neural Networks, explains that the model of logistic regression can be seen as a single layer perceptron. This chapter discusses neural networks as an extension of this idea, along with their origins and explores their power.

Chapter 6, Support Vector Machines, covers a method of transforming data into a different space using a kernel function and as an attempt to find a decision line that maximizes the margin between the classes.

Chapter 7, Tree-Based Methods, presents various tree-based methods that are popularly used, such as decision trees and the famous C5.0 algorithm. Regression trees are also covered, as well as random forests, making the link with the previous chapter on bagging. Cross validation methods for evaluating predictors are presented in the context of these tree-based methods.

Chapter 8, Dimensionality Reduction, covers PCA, ICA, Factor analysis, and Non-negative Matrix factorization.

Chapter 9, Ensemble Methods, discusses methods for combining either many predictors, or multiple trained versions of the same predictor. This chapter introduces the important notions of bagging and boosting and how to use the AdaBoost algorithm to improve performance on one of the previously analyzed datasets using a single classifier.

Chapter 10, Probabilistic Graphical Models, introduces the Naive Bayes classifier as the simplest graphical model following a discussion of conditional probability and Bayes' rule. The Naive Bayes classifier is showcased in the context of sentiment analysis. Hidden Markov Models are also introduced and demonstrated through the task of next word prediction.

Chapter 11, Topic Modeling, provides step-by-step instructions for making predictions on topic models. It will also demonstrate methods of dimensionality reduction to summarize and simplify the data.

Chapter 12, Recommendation Systems, explores different approaches to building recommender systems in R, using nearest neighbor approaches, clustering, and algorithms such as collaborative filtering.

Chapter 13, Scaling Up, explains working with very large datasets, including some worked examples of how to train some models we've seen so far with very large datasets.

Chapter 14, Deep Learning, tackles the really important topic of deep learning using examples such as word embedding and recurrent neural networks (RNNs).

What you need for this book

In order to work with and to run the code examples found in this book, the following should be noted:

  • R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and MacOS. To download R, there are a variety of locations available, including https://www.rstudio.com/products/rstudio/download.

  • R includes extensive accommodations for accessing documentation and searching for help. A good source of information is at http://www.r-project.org/help.html.

  • The capabilities of R are extended through user-created packages. Various packages are referred to and used throughout this book and the features of and access to each will be detailed as they are introduced. For example, the wordcloud package is introduced in Chapter 11, Topic Modeling to plot a cloud of words shared across documents. This is found at https://cran.r-project.org/web/packages/wordcloud/index.html.

Who this book is for

It would be helpful if the reader has had some experience with predictive analytics and the R programming language; however, this book will also be of value to readers who are new to these topics but are keen to get started as quickly as possible.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The \Drupal\Core\Url class provides static methods to generate an instance of itself, such as ::fromRoute()."

A block of code is set as follows:

  /**
   * {@inheritdoc}
   */
  public function alterRoutes(RouteCollection $collection) {
if ($route = $collection->get('mymodule.mypage)) {
      $route->setPath('/my-page');
    }
  }

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

  /**
   * {@inheritdoc}
   */
  public function alterRoutes(RouteCollection $collection) {
if ($route = $collection->get('mymodule.mypage)) {
      $route->setPath('/my-page');
    }
  }

Any command-line input or output is written as follows:

$ php core/scripts/run-tests.sh PHPUnit

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.

  2. Hover the mouse pointer on the SUPPORT tab at the top.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box.

  5. Select the book for which you're looking to download the code files.

  6. Choose from the drop-down menu where you purchased this book from.

  7. Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Predictive-Analytics-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringPredictiveAnalyticswithRSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.