Book Image

Applied Supervised Learning with R

By : Karthik Ramasubramanian, Jojo Moolayil
Book Image

Applied Supervised Learning with R

By: Karthik Ramasubramanian, Jojo Moolayil

Overview of this book

R provides excellent visualization features that are essential for exploring data before using it in automated learning. Applied Supervised Learning with R helps you cover the complete process of employing R to develop applications using supervised machine learning algorithms for your business needs. The book starts by helping you develop your analytical thinking to create a problem statement using business inputs and domain research. You will then learn different evaluation metrics that compare various algorithms, and later progress to using these metrics to select the best algorithm for your problem. After finalizing the algorithm you want to use, you will study the hyperparameter optimization technique to fine-tune your set of optimal parameters. The book demonstrates how you can add different regularization terms to avoid overfitting your model. By the end of this book, you will have gained the advanced skills you need for modeling a supervised machine learning algorithm that precisely fulfills your business needs.
Table of Contents (12 chapters)
Applied Supervised Learning with R
Preface

Holdout Approach/Validation


This is the easiest approach (though not the most recommended) used in validating model performance. We have used this approach throughout the book to test our model performance in the previous chapters. Here, we randomly divide the available dataset into training and testing datasets. Most common split ratios used between the train and test datasets are 70:30 or 80:20.

The major drawbacks of this approach are that the model performance is purely evaluated from a fractional test dataset, and it might not be the best representation for the model performance. The evaluation of the model will completely depend on the type of split, and therefore, the nature of the data points that end up in the training and testing datasets, which might then lead to significantly different results and thus high variance.

Figure 7.3: Holdout validation

The following exercise divides the dataset into 70% training and 30% testing, and builds a random forest model on the training dataset...