Book Image

Applied Supervised Learning with R

By : Karthik Ramasubramanian, Jojo Moolayil
Book Image

Applied Supervised Learning with R

By: Karthik Ramasubramanian, Jojo Moolayil

Overview of this book

R provides excellent visualization features that are essential for exploring data before using it in automated learning. Applied Supervised Learning with R helps you cover the complete process of employing R to develop applications using supervised machine learning algorithms for your business needs. The book starts by helping you develop your analytical thinking to create a problem statement using business inputs and domain research. You will then learn different evaluation metrics that compare various algorithms, and later progress to using these metrics to select the best algorithm for your problem. After finalizing the algorithm you want to use, you will study the hyperparameter optimization technique to fine-tune your set of optimal parameters. The book demonstrates how you can add different regularization terms to avoid overfitting your model. By the end of this book, you will have gained the advanced skills you need for modeling a supervised machine learning algorithm that precisely fulfills your business needs.
Table of Contents (12 chapters)
Applied Supervised Learning with R
Preface

Feature Selection


While feature engineering ensures that the quality and data issues are rectified, feature selection helps with determining the right set of features for improving the performance of the model. Feature selection techniques identify the features that contribute the most in the prediction ability of the model. Features with less importance inhibit the model's ability to learn from the independent variable.

Feature selection offers benefits such as:

  • Reducing overfitting

  • Improving accuracy

  • Reducing the time to train the model

Univariate Feature Selection

A statistical test such as the chi-squared test is a popular method to select features with a strong relationship to the dependent or target variable. It mainly works on categorical features in a classification problem. So, for this to work on a numerical variable, one needs to make the feature into categorical using discretization.

In the most general form, chi-squared statistics could be computed as follows:

This tests whether or...