Machine Learning Quick Reference

Machine Learning Quick Reference

By : Rahul Kumar

Buy this Book

Machine Learning Quick Reference

By: Rahul Kumar

Buy this Book

Overview of this book

Machine learning makes it possible to learn about the unknowns and gain hidden insights into your datasets by mastering many tools and techniques. This book guides you to do just that in a very compact manner. After giving a quick overview of what machine learning is all about, Machine Learning Quick Reference jumps right into its core algorithms and demonstrates how they can be applied to real-world scenarios. From model evaluation to optimizing their performance, this book will introduce you to the best practices in machine learning. Furthermore, you will also look at the more advanced aspects such as training neural networks and work with different kinds of data, such as text, time-series, and sequential data. Advanced methods and techniques such as causal inference, deep Gaussian processes, and more are also covered. By the end of this book, you will be able to train fast, accurate machine learning models at your fingertips, which you can easily use as a point of reference.

Title Page

About Packt

Contributors

Preface

Free Chapter

Quantifying Learning Algorithms

Statistical models

Learning curve

Curve fitting

Statistical modeling – the two cultures of Leo Breiman

Training data development data – test data

Bias-variance trade off

Regularization

Cross-validation and model selection

Model selection using cross-validation

0.632 rule in bootstrapping

Model evaluation

Receiver operating characteristic curve

H-measure

Dimensionality reduction

Summary

Evaluating Kernel Learning

Introduction to vectors

SVM

SVM example and parameter optimization through grid search

Summary

Performance in Ensemble Learning

What is ensemble learning?

Bagging

Decision tree

Random forest algorithm

Boosting

Summary

Training Neural Networks

Neural networks

Network initialization

Overfitting

Prevention of overfitting in NNs

Vanishing gradient

Recurrent neural networks

Summary

Time Series Analysis

Introduction to time series analysis

Autoregressive integrated moving average

Optimization of parameters

Anomaly detection

Summary

Natural Language Processing

TF-IDF

Summary

Temporal and Sequential Pattern Discovery

Association rules

Apriori algorithm

Frequent pattern growth

Summary

Probabilistic Graphical Models

Key concepts

Bayes rule

Bayes network

Summary

Selected Topics in Deep Learning

Deep neural networks

Backward propagation

Forward propagation equation

Backward propagation equation

Parameters and hyperparameters

Bias initialization

Generative adversarial networks

Hinton's Capsule network

Summary

Causal Inference

Granger causality

F-test

Graphical causal models

Summary

Advanced Methods

Introduction

Kernel PCA

Independent component analysis

Compressed sensing

Self-organizing maps

Bayesian multiple imputation

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Training data development data – test data

This is one of the most important steps of building a model and it can lead to lots of debate regarding whether we really need all three sets (train, dev, and test), and if so, what should be the breakup of those datasets. Let's understand these concepts.

After we have sufficient data to start modelling, the first thing we need to do is partition the data into three segments, that is, Training Set, DevelopmentSet, and Test Set:

Let's examine the goal of having these three sets:

Training Set: The training set is used to train the model. When we apply any algorithm, we are fitting the parameter in the training set. In the case of a neural network, finding out about the weights takes place.

Let's say in one scenario that we are trying to fit polynomials of various degrees:

- f(x) = a+ bx → 1^st degree polynomial
- f(x) = a + bx + cx² → 2^nd degree polynomial
- f(x) = a + bx + cx²+ dx³ → 3^rd degree polynomial

After fitting the model, we calculate the training error for all the fitted models:

We cannot assess how good the model is based on the training error. If we do that, it will lead us to a biased model that might not be able to perform well on unseen data. To counter that, we need to head into the development set.

Developmentset: This is also called the holdout set or validation set. The goal of this set is to tune the parameters that we have got from the training set. It is also part of an assessment of how well the model is performing. Based on its performance, we have to take steps to tune the parameters. For example, controlling the learning rate, minimizing the overfitting, and electing the best model of the lot all take place in the development set. Here, again, the development set error gets calculated and tuning of the model takes place after seeing which model is giving the least error. The model giving the least error at this stage still needs tuning to minimize overfitting. Once we are convinced about the best model, it is chosen and we head toward the test set.

Test set: The test set is primarily used to assess the best selected model. At this stage, the accuracy of the model is calculated, and if the model's accuracy is not too deviated from the training accuracy and development accuracy, we send this model for deployment.

Size of the training, development, and test set

Typically, machine learning practitioners choose the size of the three sets in the ratio of 60:20:20 or 70:15:15. However, there is no hard and fast rule that states that the development and test sets should be of equal size. The following diagram shows the different sizes of the training, development, and test sets:

Another example of the three different sets is as follows:

But what about the scenarios where we have big data to deal with? For example, if we have 10,000,000 records or observations, how would we partition the data? In such a scenario, ML practitioners take most of the data for the training set—as much as 98-99%—and the rest gets divided up for the development and test sets. This is done so that the practitioner can take different kinds of scenarios into account. So, even if we have 1% of data for development and the same for the test test, we will end up with 100,000 records each, and that is a good number.

Machine Learning Quick Reference

By : Rahul Kumar

Machine Learning Quick Reference

By: Rahul Kumar

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning Quick Reference

Practical Time Series Analysis

Ensemble Machine Learning Cookbook

Hands-On Python for Finance

Training data development data – test data

Size of the training, development, and test set