Mastering Machine Learning with R, Second Edition - Second Edition

Book Image

Mastering Machine Learning with R, Second Edition - Second Edition

Book Image

Mastering Machine Learning with R, Second Edition - Second Edition

Overview of this book

This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2. You will delve into statistical learning theory and supervised learning; design efficient algorithms; learn about creating Recommendation Engines; use multi-class classification and deep learning; and more. You will explore, in depth, topics such as data mining, classification, clustering, regression, predictive modeling, anomaly detection, boosted trees with XGBOOST, and more. More than just knowing the outcome, you’ll understand how these concepts work and what they do. With a slow learning curve on topics such as neural networks, you will explore deep learning, and more. By the end of this book, you will be able to perform machine learning with R in the cloud using AWS in various scenarios with different datasets.

Title Page

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

Packt Upsell

Customer Feedback

Customer Feedback

Preface

Free Chapter

A Process for Success

A Process for Success

Business understanding

Data understanding

Data preparation

Algorithm flowchart

Linear Regression - The Blocking and Tackling of Machine Learning

Linear Regression - The Blocking and Tackling of Machine Learning

Univariate linear regression

Multivariate linear regression

Other linear model considerations

Logistic Regression and Discriminant Analysis

Logistic Regression and Discriminant Analysis

Classification methods and linear regression

Logistic regression

Discriminant analysis overview

Multivariate Adaptive Regression Splines (MARS)

Model selection

Advanced Feature Selection in Linear Models

Advanced Feature Selection in Linear Models

Regularization in a nutshell

Modeling and evaluation

Model selection

Regularization and classification

More Classification Techniques - K-Nearest Neighbors and Support Vector Machines

More Classification Techniques - K-Nearest Neighbors and Support Vector Machines

K-nearest neighbors

Support vector machines

Feature selection for SVMs

Classification and Regression Trees

Classification and Regression Trees

An overview of the techniques

Neural Networks and Deep Learning

Neural Networks and Deep Learning

Introduction to neural networks

Deep learning, a not-so-deep overview

Business understanding

Data understanding and preparation

Modeling and evaluation

An example of deep learning

Cluster Analysis

Cluster Analysis

Hierarchical clustering

K-means clustering

Gower and partitioning around medoids

Business understanding

Data understanding and preparation

Modeling and evaluation

Principal Components Analysis

Principal Components Analysis

An overview of the principal components

Business understanding

Modeling and evaluation

Market Basket Analysis, Recommendation Engines, and Sequential Analysis

Market Basket Analysis, Recommendation Engines, and Sequential Analysis

An overview of a market basket analysis

Business understanding

Data understanding and preparation

Modeling and evaluation

An overview of a recommendation engine

Business understanding and recommendations

Data understanding, preparation, and recommendations

Modeling, evaluation, and recommendations

Sequential data analysis

Creating Ensembles and Multiclass Classification

Creating Ensembles and Multiclass Classification

Business and data understanding

Modeling evaluation and selection

Multiclass classification

Business and data understanding

Model evaluation and selection

Time Series and Causality

Time Series and Causality

Univariate time series analysis

Business understanding

Modeling and evaluation

Text Mining

Text mining framework and methods

Business understanding

Modeling and evaluation

R on the Cloud

Creating an Amazon Web Services account

R Fundamentals

Getting R up-and-running

Data frames and matrices

Creating summary statistics

Installing and loading R packages

Data manipulation with dplyr

Sources

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Preface

"A man deserves a second chance, but keep an eye on him" -John Wayne

It is not so often in life that you get a second chance. I remember that only days after we stopped editing the first edition, I kept asking myself, "Why didn't I...?", or "What the heck was I thinking saying it like that?", and on and on. In fact, the first project I started working on after it was published had nothing to do with any of the methods in the first edition. I made a mental note that if given the chance, it would go into a second edition.

When I started with the first edition, my goal was to create something different, maybe even create a work that was a pleasure to read, given the constraints of the topic. After all the feedback I received, I think I hit the mark. However, there is always room for improvement, and if you try and be everything to all people, you become nothing to everybody. I'm reminded of one of my favorite Frederick the great quotes, "He who defends everything, defends nothing". So, I've tried to provide enough of the skills and tools, but not all of them, to get a reader up and running with R and machine learning as quickly and painlessly as possible. I think I've added some interesting new techniques that build on what was in the first edition. There will probably always be the detractors who complain it does not offer enough math or does not do this, that, or the other thing, but my answer to that is they already exist! Why duplicate what was already done, and very well, for that matter? Again, I have sought to provide something different, something that would keep the reader's attention and allow them to succeed in this competitive field.

Before I provide a list of the changes/improvements incorporated into the second edition, chapter by chapter, let me explain some universal changes. First of all, I have surrendered in my effort to fight the usage of the assignment operator <- versus just using =. As I shared more and more code with others, I realized I was out on my own using = and not <-. The first thing I did when under contract for the second edition was go line by line in the code and change it. The more important part, perhaps, was to clean and standardize the code. This is also important when you have to share code with coworkers and, dare I say, regulators. Using RStudio facilitates this standardization in the most recent versions. What sort of standards! Well, the first thing is to properly space the code. For instance, I would not hesitate in the past to write c(1,2,3,4,5,6). Not anymore! Now, I will write this--c(1, 2, 3, 4, 5, 6)--as a space after commas, which makes it easier to read. If you want other ideas, please have a look a Google's R style guide, https://google.github.io/styleguide/Rguide.xml/. I also received a number of e-mails saying that the data I scraped off the Web wasn't available. The National Hockey League decided to launch a completely new version of their statistics, so I had to start from scratch. Problems such as that led me to put data on GitHub.

All in all, I put forth a rather large effort to put the best possible tool in your hands to get you going. On another note, in the month of February '17, there was much attention on the Web on these comments from entrepreneur Mark Cuban:

"Artificial Intelligence, deep learning, machine learning--whatever you’re doing if you don’t understand it--learn it. Because otherwise you’re going to be a dinosaur within 3 years."
"I personally think there's going to be a greater demand in 10 years for liberal arts majors than there were for programming majors and maybe even engineering, because when the data is all being spit out for you, options are being spit out for you, you need a different perspective in order to have a different view of the data. And so is having someone who is more of a freer thinker."

Besides the fact that these comments created a bit of a stir on the blogosphere, they also seem to be, at first glance, mutually exclusive. But think about what he is saying here. I think he gets to the core of why I felt compelled to write this book. Here is what I believe, machine learning needs to be embraced and utilized, to some extent, by the masses: the tired, the poor, the hungry, the proletariat, and the bourgeoisie. More and more availability of computational power and information will make machine learning something for virtually everyone. However, the flip side of that and what, in my mind, has been and will continue to be a problem is the communication of results. What are you going to do when you describe true positive rate and false positive rate and receive blank stares? How do you quickly tell a story that enlightens your audience? If you think it can't happen, please drop me a note, I'd be more than happy to share my story.

We must have people who can lead these efforts and influence their organization. If a degree in history or music appreciation helps in that endeavor, then so be it. I study history every day, and it has helped me tremendously. Cuban's comments have reinforced my belief that in many ways, the first chapter is the most important in this book. If you are not asking your business partners "what they plan to do differently", you'd better start tomorrow. There are far too many people working far too hard to complete an analysis that is completely irrelevant to the organization and its decisions.

What this book covers

Here is a list of changes from the first edition by chapter:

Chapter 1, A process for success, has the flowchart redone to update an unintended typo and add additional methodologies.

Chapter 2, Linear Regression – the Blocking and Tackling of Machine Learning, has the code improved, and better charts have been provided; other than that, it remains relatively close to the original.

Chapter 3, Logistic Regression and Discriminant Analysis, has the code improved and streamlined. One of my favorite techniques, multivariate adaptive regression splines, has been added; it performs well, handles non-linearity, and is easy to explain. It is my base model, with others becoming "challengers" to try and outperform it.

Chapter 4, Advanced Feature Selection in Linear Models, has techniques not only for regression but also for a classification problem included.

Chapter 5, More Classification Techniques – K-Nearest Neighbors and Support Vector Machines, has the code streamlined and simplified.

Chapter 6, Classification and Regression Trees, has the addition of the very popular techniques provided by the XGBOOST package. Additionally, I added the technique of using random forest as a feature selection tool.

Chapter 7, Neural Networks and Deep Learning, has been updated with additional information on deep learning methods and has improved code for the H2O package, including hyper-parameter search.

Chapter 8, Cluster Analysis, has the methodology of doing unsupervised learning with random forests added.

Chapter 9, Principal Components Analysis, uses a different dataset, and an out-of-sample prediction has been added.

Chapter 10, Market Basket Analysis, Recommendation Engines, and Sequential Analysis, has the addition of sequential analysis, which, I'm discovering, is more and more important, especially in marketing.

Chapter 11, Creating Ensembles and Multiclass Classification, has completely new content, using several great packages.

Chapter 12, Time Series and Causality, has a couple of additional years of climate data added, along with a demonstration of different methods of causality test.

Chapter 13, Text Mining, has additional data and improved code.

Chapter 14, R on the Cloud, is another chapter of new content, allowing you to get R on the cloud, simply and quickly.

Appendix A, R Fundamentals, has additional data manipulation methods.Appendix B, Sources, has a list of sources and references.

What you need for this book

As R is free and open source software, you will only need to download and install it from https://www.r-project.org/. Although it is not mandatory, it is highly recommended that you download IDE and RStudio from https://www.rstudio.com/products/RStudio/.

Who this book is for

This book is for data science professionals, data analysts, or anyone with working knowledge of machine learning with R, who now want to take their skills to the next level and become an expert in the field.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The data frame is available in the R MASS package under the biopsy name."

Any command-line input or output is written as follows:

 > bestglm(Xy = biopsy.cv, IC="CV", 
   CVArgs=list(Method="HTF", K=10, 
   REP=1), family=binomial)

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter."

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Machine-Learning-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringMachineLearningwithRSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.