Advanced Machine Learning with R

Advanced Machine Learning with R

By : Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Buy this Book

Advanced Machine Learning with R

By: Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Buy this Book

Overview of this book

R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics. This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You’ll work through realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. Next, you’ll explore different clustering techniques to segment customers using wholesale data and even apply TensorFlow and Keras-R for performing advanced computations. Each chapter will help you implement advanced machine learning algorithms using real-world examples. You’ll also be introduced to reinforcement learning along with its use cases and models. Finally, this Learning Path will provide you with a glimpse into how some of these black box models can be diagnosed and understood. By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.

Title Page

About Packt

Contributors

Preface

Free Chapter

Preparing and Understanding Data

Overview

Reading the data

Handling duplicate observations

Handling missing values

Zero and near-zero variance features

Treating the data

Summary

Linear Regression

Univariate linear regression

Multivariate linear regression

Summary

Logistic Regression

Classification methods and linear regression

Logistic regression

Model training and evaluation

Summary

Advanced Feature Selection in Linear Models

Regularization overview

Data creation

Modeling and evaluation

Summary

K-Nearest Neighbors and Support Vector Machines

K-nearest neighbors

Support vector machines

Manipulating data

Modeling and evaluation

Summary

Tree-Based Classification

An overview of the techniques

Datasets and modeling

Summary

Neural Networks and Deep Learning

Introduction to neural networks

Deep learning – a not-so-deep overview

Creating a simple neural network

An example of deep learning

Summary

Creating Ensembles and Multiclass Methods

Ensembles

Data understanding

Modeling and evaluation

Summary

Cluster Analysis

Hierarchical clustering

Data understanding and preparation

Modeling

Summary

Principal Component Analysis

An overview of the principal components

Data

PCA modeling

Summary

Association Analysis

An overview of association analysis

Data understanding

Data preparation

Modeling and evaluation

Summary

Time Series and Causality

Univariate time series analysis

Time series data

Modeling and evaluation

Summary

Text Mining

Text mining framework and methods

N-grams

Additional quantitative analysis

Summary

Exploring the Machine Learning Landscape

ML versus software engineering

Types of ML methods

ML terminology – a quick review

ML project pipeline

Learning paradigm

Datasets

Summary

Predicting Employee Attrition Using Ensemble Models

Philosophy behind ensembling

Getting started

Understanding the attrition problem and the dataset

K-nearest neighbors model for benchmarking the performance

Bagging

Randomization with random forests

Boosting

Stacking

Summary

Implementing a Jokes Recommendation Engine

Fundamental aspects of recommendation engines

Getting started

Understanding the Jokes recommendation problem and the dataset

Building a recommendation system with an item-based collaborative filtering technique

Building a recommendation system with a user-based collaborative filtering technique

Building a recommendation system based on an association-rule mining technique

Content-based recommendation engine

Building a hybrid recommendation system for Jokes recommendations

Summary

References

Sentiment Analysis of Amazon Reviews with NLP

The sentiment analysis problem

Getting started

Understanding the Amazon reviews dataset

Building a text sentiment classifier with the BoW approach

Understanding word embedding

Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Building a text sentiment classifier with GloVe word embedding

Building a text sentiment classifier with fastText

Summary

Customer Segmentation Using Wholesale Data

Understanding customer segmentation

Understanding the wholesale customer dataset and the segmentation problem

Identifying the customer segments in wholesale customer data using k-means clustering

Identifying the customer segments in the wholesale customer data using DIANA

Identifying the customer segments in the wholesale customers data using AGNES

Summary

Image Recognition Using Deep Neural Networks

Technical requirements

Understanding computer vision

Achieving computer vision with deep learning

Introduction to the MXNet framework

Understanding the MNIST dataset

Implementing a deep learning network for handwritten digit recognition

Implementing computer vision with pretrained models

Summary

Credit Card Fraud Detection Using Autoencoders

Machine learning in credit card fraud detection

Autoencoders explained

The credit card fraud dataset

Building AEs with the H2O library in R

Summary

Automatic Prose Generation with Recurrent Neural Networks

Understanding language models

Exploring recurrent neural networks

Backpropagation through time

Problems and solutions to gradients in RNN

Building an automated prose generator with an RNN

Summary

Winning the Casino Slot Machines with Reinforcement Learning

Understanding RL

Multi-arm bandit – real-world use cases

Solving the MABP with UCB and Thompson sampling algorithms

Summary

Creating a Package

Creating a new package

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Overview

If you haven't been exposed to large, messy datasets, then be patient, for it's only a matter of time. If you've encountered such data, has it been in a domain where you have little subject matter expertise? If not, then once again I proffer that it's only a matter of time. Some of the common problems that make up this term messy data include the following:

Missing or invalid values
Novel levels in a categorical feature that show up in algorithm production
High cardinality in categorical features such as zip codes
High dimensionality
Duplicate observations

So this begs the question what are we to do? Well, first we need to look at what are the critical tasks that need to be performed during this phase of the process. The following tasks serve as the foundation for building a learning algorithm. They're from the paper by SPSS, CRISP-DM 1.0, a step-by-step data-mining guide available at https://the-modeling-agency.com/crisp-dm.pdf:

Data understanding:
1. Collect
2. Describe
3. Explore
4. Verify
Data preparation:
1. Select
2. Clean
3. Construct
4. Integrate
5. Format

Certainly this is an excellent enumeration of the process, but what do we really need to do? I propose that, in practical terms we can all relate to, the following must be done once the data is joined and loaded into your machine, cloud, or whatever you use:

Understand the data structure
Dedupe observations
Eliminate zero variance features and low variance features as desired
Handle missing values
Create dummy features (one-hot encoding)
Examine and deal with highly correlated features and those with perfect linear relationships
Scale as necessary
Create other features as desired

Many feel that this is a daunting task. I don't and, in fact, I quite enjoy it. If done correctly and with a judicious application of judgment, it should reduce the amount of time spent at this first stage of a project and facilitate training your learning algorithm. None of the previous steps are challenging, but it can take quite a bit of time to write the code to perform each task.

Well, that's the benefit of this chapter. The example to follow will walk you through the tasks and the R code that accomplishes it. The code is flexible enough that you should be able to apply it to your projects. Additionally, it will help you gain an understanding of the data at a point you can intelligently discuss it with Subject Matter Experts (SMEs) if, in fact, they're available.

In the practical exercise that follows, we'll work with a small dataset. However, it suffers from all of the problems described earlier. Don't let the small size fool you, as we'll take what we learn here and use it for the more massive datasets to come in subsequent chapters.

As background, the data we'll use I put together painstakingly by hand. It's the Order of Battle for the opposing armies at the Battle of Gettysburg, fought during the American Civil War, July 1^st-3^rd, 1863, and the casualties reported by the end of the day on July 3^rd. I purposely chose this data because I'm reasonably sure you know very little about it. Don't worry, I'm the SME on the battle here and will walk you through it every step of the way. The one thing that we won't cover in this chapter is dealing with large volumes of textual features, which we'll discuss later in this book. Enough said already; let's get started!

Note

The source used in the creation of the dataset is The Gettysburg Campaign in Numbers and Losses: Synopses, Orders of Battle, Strengths, Casualties, and Maps, June 9-July 14, 1863, by J. David Petruzzi and Steven A. Stanley.

Advanced Machine Learning with R

By : Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Advanced Machine Learning with R

By: Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Overview of this book

Related Content you might be interested in

Current Title:

Advanced Machine Learning with R

Overview

Note