Advanced Machine Learning with R

Advanced Machine Learning with R

By : Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Buy this Book

Advanced Machine Learning with R

By: Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Buy this Book

Overview of this book

R is one of the most popular languages when it comes to exploring the mathematical side of machine learning and easily performing computational statistics. This Learning Path shows you how to leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. You’ll work through realistic projects such as building powerful machine learning models with ensembles to predict employee attrition. Next, you’ll explore different clustering techniques to segment customers using wholesale data and even apply TensorFlow and Keras-R for performing advanced computations. Each chapter will help you implement advanced machine learning algorithms using real-world examples. You’ll also be introduced to reinforcement learning along with its use cases and models. Finally, this Learning Path will provide you with a glimpse into how some of these black box models can be diagnosed and understood. By the end of this Learning Path, you’ll be equipped with the skills you need to deploy machine learning techniques in your own projects.

Title Page

About Packt

Contributors

Preface

Free Chapter

Preparing and Understanding Data

Overview

Reading the data

Handling duplicate observations

Handling missing values

Zero and near-zero variance features

Treating the data

Summary

Linear Regression

Univariate linear regression

Multivariate linear regression

Summary

Logistic Regression

Classification methods and linear regression

Logistic regression

Model training and evaluation

Summary

Advanced Feature Selection in Linear Models

Regularization overview

Data creation

Modeling and evaluation

Summary

K-Nearest Neighbors and Support Vector Machines

K-nearest neighbors

Support vector machines

Manipulating data

Modeling and evaluation

Summary

Tree-Based Classification

An overview of the techniques

Datasets and modeling

Summary

Neural Networks and Deep Learning

Introduction to neural networks

Deep learning – a not-so-deep overview

Creating a simple neural network

An example of deep learning

Summary

Creating Ensembles and Multiclass Methods

Ensembles

Data understanding

Modeling and evaluation

Summary

Cluster Analysis

Hierarchical clustering

Data understanding and preparation

Modeling

Summary

Principal Component Analysis

An overview of the principal components

Data

PCA modeling

Summary

Association Analysis

An overview of association analysis

Data understanding

Data preparation

Modeling and evaluation

Summary

Time Series and Causality

Univariate time series analysis

Time series data

Modeling and evaluation

Summary

Text Mining

Text mining framework and methods

N-grams

Additional quantitative analysis

Summary

Exploring the Machine Learning Landscape

ML versus software engineering

Types of ML methods

ML terminology – a quick review

ML project pipeline

Learning paradigm

Datasets

Summary

Predicting Employee Attrition Using Ensemble Models

Philosophy behind ensembling

Getting started

Understanding the attrition problem and the dataset

K-nearest neighbors model for benchmarking the performance

Bagging

Randomization with random forests

Boosting

Stacking

Summary

Implementing a Jokes Recommendation Engine

Fundamental aspects of recommendation engines

Getting started

Understanding the Jokes recommendation problem and the dataset

Building a recommendation system with an item-based collaborative filtering technique

Building a recommendation system with a user-based collaborative filtering technique

Building a recommendation system based on an association-rule mining technique

Content-based recommendation engine

Building a hybrid recommendation system for Jokes recommendations

Summary

References

Sentiment Analysis of Amazon Reviews with NLP

The sentiment analysis problem

Getting started

Understanding the Amazon reviews dataset

Building a text sentiment classifier with the BoW approach

Understanding word embedding

Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Building a text sentiment classifier with GloVe word embedding

Building a text sentiment classifier with fastText

Summary

Customer Segmentation Using Wholesale Data

Understanding customer segmentation

Understanding the wholesale customer dataset and the segmentation problem

Identifying the customer segments in wholesale customer data using k-means clustering

Identifying the customer segments in the wholesale customer data using DIANA

Identifying the customer segments in the wholesale customers data using AGNES

Summary

Image Recognition Using Deep Neural Networks

Technical requirements

Understanding computer vision

Achieving computer vision with deep learning

Introduction to the MXNet framework

Understanding the MNIST dataset

Implementing a deep learning network for handwritten digit recognition

Implementing computer vision with pretrained models

Summary

Credit Card Fraud Detection Using Autoencoders

Machine learning in credit card fraud detection

Autoencoders explained

The credit card fraud dataset

Building AEs with the H2O library in R

Summary

Automatic Prose Generation with Recurrent Neural Networks

Understanding language models

Exploring recurrent neural networks

Backpropagation through time

Problems and solutions to gradients in RNN

Building an automated prose generator with an RNN

Summary

Winning the Casino Slot Machines with Reinforcement Learning

Understanding RL

Multi-arm bandit – real-world use cases

Solving the MABP with UCB and Thompson sampling algorithms

Summary

Creating a Package

Creating a new package

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
  587    3

which(dupes == "TRUE")
[1] 588 589

Note

If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes() function. The code for that would be simply: df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

Descriptive statistics

Traditionally, we could use the base R summary() function to identify some basic statistics. Now, and recently I might add, I like to use the package sjmisc and its descr() function. It produces a more readable output, and you can assign that output to a dataframe. What works well is to create that dataframe, save it as a .csv, and explore it at your leisure. It automatically selects numeric features only. It also fits well with tidyverse so that you can incorporate dplyr functions such as group_by() and filter(). Here's an example in our case where we examine the descriptive stats for the infantry of the Confederate Army. The output will consist of the following:

var: feature name
type: integer
n: number of observations
NA.prc: percent of missing values
mean
sd: standard deviation
se: standard error
md: median
trimmed: trimmed mean
range
skew

gettysburg %>%
  dplyr::filter(army == "Confederate" & type == "Infantry") %>%
  sjmisc::descr() -> descr_stats

readr::write_csv(descr_stats, 'descr_stats.csv')

The following is abbreviated output from the preceding code saved to a spreadsheet:

In this one table, we can discern some rather interesting tidbits. In particular is the percent of missing values per feature. If you modify the precious code to examine the Union Army, you'll find that there're no missing values. The reason the usurpers from the South had missing values is based on a couple of factors; either shoddy staff work in compiling the numbers on July 3^rd or the records were lost over the years. Note that, for the number of men captured, if you remove the missing value, all other values are zero, so we could just replace the missing value with it. The Rebels did not report troops as captured, but rather as missing, in contrast with the Union.

Once you feel comfortable with the descriptive statistics, move on to exploring the categorical features in the next section.

Exploring categorical variables

When it comes to an understanding of your categorical variables, there're many different ways to go about it. We can easily use the base R table() function on a feature. If you just want to see how many distinct levels are in a feature, then dplyr works well. In this example, we examine type, which has three unique levels:

dplyr::count(gettysburg, dplyr::n_distinct(type))

The output of the preceding code is as follows:

# A tibble: 1 x 2
     `dplyr::n_distinct(type)`        n
                                            <int> <int>
                                                     3    587

Let's now look at a way to explore all of the categorical features utilizing tidyverse principles. Doing it this way always allows you to save the tibble and examine the results in depth as needed. Here is a way of putting all categorical features into a separate tibble:

gettysburg_cat <-
  gettysburg[, sapply(gettysburg, class) == 'character']

Using dplyr, you can now summarize all of the features and the number of distinct levels in each:

gettysburg_cat %>%
  dplyr::summarise_all(dplyr::funs(dplyr::n_distinct(.)))

The output of the preceding code is as follows:

# A tibble: 1 x 9
   type  state regiment_or_battery brigade division corps  army july1_Commander  Cdr_casualty
 <int> <int>                                  <int>      <int>      <int> <int> <int>                            <int>                  <int>
          3       30                                      275           124            38        14          2                                586                           6

Notice that there're 586 distinct values to july1_Commander. This means that two of the unit Commanders have the same rank and last name. We can also surmise that this feature will be of no value to any further analysis, but we'll deal with that issue in a couple of sections ahead.

Suppose we're interested in the number of observations for each of the levels for the Cdr_casualty feature. Yes, we could use table(), but how about producing the output as a tibble as discussed before? Give this code a try:

gettysburg_cat %>% 
  dplyr::group_by(Cdr_casualty) %>%
  dplyr::summarize(num_rows = n())

The output of the preceding code is as follows:

# A tibble: 6 x 2
 Cdr_casualty                    num_rows
    <chr>                           <int>
 1 captured                            6
 2 killed                             29
 3 mortally wounded                   24
 4 no                                405
 5 wounded                           104
 6 wounded-captured                   19

Speaking of tables, let's look at a tibble-friendly way of producing one using two features. This code takes the idea of comparing commander casualties by army:

gettysburg_cat %>%
  janitor::tabyl(army, Cdr_casualty)

The output of the preceding code is as follows:

army   captured killed mortally wounded   no  wounded  wounded-captured
Confederate  2    15               13     165    44             17
Union        4    14               11     240    60              2

Explore the data on your own and, once you're comfortable with the categorical variables, let's tackle the issue of missing values.

Advanced Machine Learning with R

By : Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Advanced Machine Learning with R

By: Cory Lesmeister, Dr. Sunil Kumar Chinnamgari

Overview of this book

Related Content you might be interested in

Current Title:

Advanced Machine Learning with R

Handling duplicate observations

Note

Descriptive statistics

Exploring categorical variables