Advanced Machine Learning with R

Advanced Machine Learning with R

By : Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Buy this Book

Advanced Machine Learning with R

By: Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Buy this Book

Overview of this book

R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics. This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You’ll work through realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. Next, you’ll explore different clustering techniques to segment customers using wholesale data and even apply TensorFlow and Keras-R for performing advanced computations. Each chapter will help you implement advanced machine learning algorithms using real-world examples. You’ll also be introduced to reinforcement learning along with its use cases and models. Finally, this Learning Path will provide you with a glimpse into how some of these black box models can be diagnosed and understood. By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.

Title Page

About Packt

Contributors

Preface

Free Chapter

Preparing and Understanding Data

Overview

Reading the data

Handling duplicate observations

Handling missing values

Zero and near-zero variance features

Treating the data

Summary

Linear Regression

Univariate linear regression

Multivariate linear regression

Summary

Logistic Regression

Classification methods and linear regression

Logistic regression

Model training and evaluation

Summary

Advanced Feature Selection in Linear Models

Regularization overview

Data creation

Modeling and evaluation

Summary

K-Nearest Neighbors and Support Vector Machines

K-nearest neighbors

Support vector machines

Manipulating data

Modeling and evaluation

Summary

Tree-Based Classification

An overview of the techniques

Datasets and modeling

Summary

Neural Networks and Deep Learning

Introduction to neural networks

Deep learning – a not-so-deep overview

Creating a simple neural network

An example of deep learning

Summary

Creating Ensembles and Multiclass Methods

Ensembles

Data understanding

Modeling and evaluation

Summary

Cluster Analysis

Hierarchical clustering

Data understanding and preparation

Modeling

Summary

Principal Component Analysis

An overview of the principal components

Data

PCA modeling

Summary

Association Analysis

An overview of association analysis

Data understanding

Data preparation

Modeling and evaluation

Summary

Time Series and Causality

Univariate time series analysis

Time series data

Modeling and evaluation

Summary

Text Mining

Text mining framework and methods

N-grams

Additional quantitative analysis

Summary

Exploring the Machine Learning Landscape

ML versus software engineering

Types of ML methods

ML terminology – a quick review

ML project pipeline

Learning paradigm

Datasets

Summary

Predicting Employee Attrition Using Ensemble Models

Philosophy behind ensembling

Getting started

Understanding the attrition problem and the dataset

K-nearest neighbors model for benchmarking the performance

Bagging

Randomization with random forests

Boosting

Stacking

Summary

Implementing a Jokes Recommendation Engine

Fundamental aspects of recommendation engines

Getting started

Understanding the Jokes recommendation problem and the dataset

Building a recommendation system with an item-based collaborative filtering technique

Building a recommendation system with a user-based collaborative filtering technique

Building a recommendation system based on an association-rule mining technique

Content-based recommendation engine

Building a hybrid recommendation system for Jokes recommendations

Summary

References

Sentiment Analysis of Amazon Reviews with NLP

The sentiment analysis problem

Getting started

Understanding the Amazon reviews dataset

Building a text sentiment classifier with the BoW approach

Understanding word embedding

Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Building a text sentiment classifier with GloVe word embedding

Building a text sentiment classifier with fastText

Summary

Customer Segmentation Using Wholesale Data

Understanding customer segmentation

Understanding the wholesale customer dataset and the segmentation problem

Identifying the customer segments in wholesale customer data using k-means clustering

Identifying the customer segments in the wholesale customer data using DIANA

Identifying the customer segments in the wholesale customers data using AGNES

Summary

Image Recognition Using Deep Neural Networks

Technical requirements

Understanding computer vision

Achieving computer vision with deep learning

Introduction to the MXNet framework

Understanding the MNIST dataset

Implementing a deep learning network for handwritten digit recognition

Implementing computer vision with pretrained models

Summary

Credit Card Fraud Detection Using Autoencoders

Machine learning in credit card fraud detection

Autoencoders explained

The credit card fraud dataset

Building AEs with the H2O library in R

Summary

Automatic Prose Generation with Recurrent Neural Networks

Understanding language models

Exploring recurrent neural networks

Backpropagation through time

Problems and solutions to gradients in RNN

Building an automated prose generator with an RNN

Summary

Winning the Casino Slot Machines with Reinforcement Learning

Understanding RL

Multi-arm bandit – real-world use cases

Solving the MABP with UCB and Thompson sampling algorithms

Summary

Creating a Package

Creating a new package

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Treating the data

What do I mean when I say let's treat the data? I learned the term from the authors of the vtreat package, Nina Zumel, and John Mount. You can read their excellent paper on the subject at this link: https://arxiv.org/pdf/1611.09477.pdf.

The definition they provide is: processor or conditioner that prepares real-world data for predictive modeling in a statistically sound manner. In treating your data, you'll rid yourself of many of the data preparation headaches discussed earlier. The example with our current dataset will provide an excellent introduction into the benefits of this method and how you can tailor it to your needs. I kind of like to think that treating your data is a smarter version of one-hot encoding.

The package offers three different functions to treat data, but I only use one and that is designTreatmentsZ(), which treats the features without regard to an outcome or response. The functions designTreatmentsC() and designTreatmentsN() functions build dataframes based on categorical and numeric outcomes respectively. Those functions provide a method to prune features in a univariate fashion. I'll provide other ways of conducting feature selection, so that's why I use that specific function. I encourage you to experiment on your own.

The function we use in the following will produce an object that you can apply to training, validation, testing, and even production data. In later chapters, we'll focus on training and testing, but here let's treat the entire data without considerations of any splits for simplicity. There're a number of arguments in the function you can change, but the defaults are usually sufficient. We'll specify the input data, the feature names to include, and minFraction, which is defined by the package as the optional minimum frequency a categorical level must have to be converted into an indicator column. I've chosen 5% and the minimum frequency. In real-world data, I've seen this number altered many times to find the right level of occurrence:

my_treatment <- vtreat::designTreatmentsZ(
  dframe = gettysburg_fltrd,
  varlist = colnames(gettysburg_fltrd),
  minFraction = 0.05
)

We now have an object with a stored treatment plan. Now we just use the prepare() function to apply that treatment to a dataframe or tibble, and it'll give us a treated dataframe:

gettysburg_treated <- vtreat::prepare(my_treatment, gettysburg_fltrd)

dim(gettysburg_treated)

The output of the preceding code is as follows:

[1]   587      54

We now have 54 features. Let's take a look at their names:

colnames(gettysburg_treated)

The abbreviated output of the preceding code is as follows:

[1]     "type_catP"       "state_catP"      "regiment_or_battery_catP"
[4]  "brigade_catP"    "division_catP"                    "corps_catP"

As you explore the names, you'll notice that we have features ending in catP, clean, and isBAD and others with _lev_x_ in them. Let's cover each in detail. As for catP features, the function creates a feature that's the frequency for the categorical level in that observation. What does that mean? Let's see a table for type_catP:

table(gettysburg_treated$type_catP)

The output of the preceding code is as follows:

0.080068143100   0.21976149914    0.70017035775
            47             129              411

This tells us that 47 rows are of category level x (in this case, Cavalry), and this is 8% of the total observations. As such, 22% are Artillery and 70% Infantry. This can be helpful in further exploring your data and to help adjust the minimum frequency in your category levels. I've heard it discussed that these values could help in the creation of a distance or similarity matrix.

The next is clean. These are our numeric features that have had missing values imputed, which is the feature mean, and outliers winsorized or collared if you specified the argument in the prepare() function. We didn't, so only missing values were imputed.

Note

Here's an interesting blog post of the merits of winsorizing from SAS: https://blogs.sas.com/content/iml/2017/02/08/winsorization-good-bad-and-ugly.html.

Speaking of missing values, this brings us to isBAD. This feature is the 1 for missing and 0 if not missing we talked about where I manually coded it.

Finally, lev_x is the dummy feature coding for a specific categorical level. If you go through the levels that were hot-encoded for states, you'll find features for Georgia, New York, North Carolina, Pennsylvania, US (this is US Regular Army units), and Virginia.

My preference is to remove the catP features and remove the clean from the feature name, and change isBAD to isNA. This a simple task with these lines of code:

gettysburg_treated <- 
  gettysburg_treated %>%
  dplyr::select(-dplyr::contains('_catP'))

colnames(gettysburg_treated) <-
  sub('_clean', "", colnames(gettysburg_treated))

colnames(gettysburg_treated) <-
  sub('_isBAD', "_isNA", colnames(gettysburg_treated))

Are we ready to start building learning algorithms? Well, not quite yet. In the next section, we'll deal with highly correlated and linearly related features.

Correlation and linearity

For this task, we return to our old friend the caret package. We'll start by creating a correlation matrix, using the Spearman Rank method, then apply the findCorrelation() function for all correlations above 0.9:

df_corr <- cor(gettysburg_treated, method = "spearman")

high_corr <- caret::findCorrelation(df_corr, cutoff = 0.9)

Note

Why Spearman versus Pearson correlation? Spearman is free from any distribution assumptions and is robust enough for any task at hand: http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/.

The high_corrobject is a list of integers that correspond to feature column numbers. Let's dig deeper into this:

high_corr

The output of the preceding code is as follows:

[1]   9   4    22    43   3   5

The column indices refer to the following feature names:

colnames(gettysburg_treated)[c(9, 4, 22, 43, 3, 5)]

The output of the preceding code is as follows:

[1]                       "total_casualties"       "wounded" "type_lev_x_Artillery"
 [4] "army_lev_x_Confederate" "killed_isNA"              "wounded_isNA"

We saw the features that're highly correlated to some other feature. For instance, army_lev_x_Confederate is perfectly and negatively correlation with army_lev_x_Union. After all, you can only two armies here, and Colonel Fremantle of the British Coldstream Guards was merely an observer. To delete these features, just filter your dataframe by the list we created:

gettysburg_noHighCorr <- gettysburg_treated[, -high_corr]

There you go, they're now gone. But wait! That seems a little too clinical, and maybe we should apply our judgment or the judgment of an SME to the problem? As before, let's create a tibble for further exploration:

df_corr <- data.frame(df_corr)

df_corr$feature1 <- row.names(df_corr)

gettysburg_corr <-
  tidyr::gather(data = df_corr,
                key = "feature2",
                value = "correlation",
                -feature1)

gettysburg_corr <- 
  gettysburg_corr %>%
  dplyr::filter(feature1 != feature2)

What just happened? First of all, the correlation matrix was turned into a dataframe. Then, the row names became the values for the first feature. Using tidyr, the code created the second feature and placed the appropriate value with an observation, and we cleaned it up to get unique pairs. This screenshot shows the results. You can see that the Confederate and Union armies have a perfect negative correlation:

You can see that it would be safe to dedupe on correlation as we did earlier. I like to save this to a spreadsheet and work with SMEs to understand what features we can drop or combine and so on.

After handling the correlations, I recommend exploring and removing as needed linear combinations. Dealing with these combinations is a similar methodology to high correlations:

linear_combos <- caret::findLinearCombos(gettysburg_noHighCorr)

linear_combos

The output of the preceding code is as follows:

$`linearCombos`
 $`linearCombos`[[1]]
 [1] 16 7 8 9 10 11 12 13 14 15

 $remove
 [1] 16

The output tells us that feature column 16 is linearly related to those others, and we can solve the problem by removing it. What are these feature names? Let's have a look:

colnames(gettysburg_noHighCorr)[c(16, 7, 8, 9, 10, 11, 12, 13, 14, 15)]

The output of the preceding code is as follows:

[1]            "total_guns"         "X3inch_rifles" "X10lb_parrots"    "X12lb_howitzers" "X12lb_napoleons"
 [6] "X6lb_howitzers" "X24lb_howitzers" "X20lb_parrots" "X12lb_whitworths"          "X14lb_rifles"

Removing the feature on the number of "total_guns" will solve the problem. This makes total sense since it's the number of guns in an artillery battery. Most batteries, especially in the Union, had only one type of gun. Even with multiple linear combinations, it's an easy task with this bit of code to get rid of the necessary features:

linear_remove <- colnames(gettysburg_noHighCorr[16])

df <- gettysburg_noHighCorr[, !(colnames(gettysburg_noHighCorr) %in% linear_remove)]

dim(df)

The output of the preceding code is as follows:

[1] 587   39

There you have it, a nice clean dataframe of 587 observations and 39 features. Now depending on the modeling, you may have to scale this data or perform other transformations, but this data, in this format, makes all of that easier. Regardless of your prior knowledge or interest of one of the most important battles in history, and the bloodiest on American soil, you've developed a workable understanding of the Order of Battle, and the casualties at the regimental or battery level. Start treating your data, not next week or next month, but right now!

Note

If you desire, you can learn more about the battle here: https://www.battlefields.org/learn/civil-war/battles/gettysburg.

Advanced Machine Learning with R

By : Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Advanced Machine Learning with R

By: Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Overview of this book

Related Content you might be interested in

Current Title:

Advanced Machine Learning with R

Treating the data

Note

Correlation and linearity

Note

Note