Statistics for Machine Learning

Statistics for Machine Learning

By : Pratap Dangeti

Buy this Book

Statistics for Machine Learning

By: Pratap Dangeti

Buy this Book

Overview of this book

Complex statistics in machine learning worry a lot of developers. Knowing statistics helps you build strong machine learning models that are optimized for a given problem statement. This book will teach you all it takes to perform the complex statistical computations that are required for machine learning. You will gain information on the statistics behind supervised learning, unsupervised learning, reinforcement learning, and more. You will see real-world examples that discuss the statistical side of machine learning and familiarize yourself with it. You will come across programs for performing tasks such as modeling, parameter fitting, regression, classification, density collection, working with vectors, matrices, and more. By the end of the book, you will have mastered the statistics required for machine learning and will be able to apply your new skills to any sort of industry problem.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Journey from Statistics to Machine Learning

Statistical terminology for model building and validation

Machine learning terminology for model building and validation

Machine learning model overview

Summary

Parallelism of Statistics and Machine Learning

Comparison between regression and machine learning models

Compensating factors in machine learning models

Machine learning models - ridge and lasso regression

Summary

Logistic Regression Versus Random Forest

Maximum likelihood estimation

Logistic regression – introduction and advantages

Random forest

Variable importance plot

Comparison of logistic regression with random forest

Summary

Tree-Based Machine Learning Models

Introducing decision tree classifiers

Comparison between logistic regression and decision trees

Comparison of error components across various styles of models

Remedial actions to push the model towards the ideal region

HR attrition data example

Decision tree classifier

Tuning class weights in decision tree classifier

Bagging classifier

Random forest classifier

Random forest classifier - grid search

AdaBoost classifier

Gradient boosting classifier

Comparison between AdaBoosting versus gradient boosting

Extreme gradient boosting - XGBoost classifier

Ensemble of ensembles - model stacking

Ensemble of ensembles with different types of classifiers

Ensemble of ensembles with bootstrap samples using a single type of classifier

Summary

K-Nearest Neighbors and Naive Bayes

K-nearest neighbors

KNN classifier with breast cancer Wisconsin data example

Tuning of k-value in KNN classifier

Naive Bayes

Probability fundamentals

Understanding Bayes theorem with conditional probability

Naive Bayes classification

Laplace estimator

Naive Bayes SMS spam classification example

Summary

Support Vector Machines and Neural Networks

Support vector machines working principles

Kernel functions

SVM multilabel classifier with letter recognition data example

Artificial neural networks - ANN

Activation functions

Forward propagation and backpropagation

Optimization of neural networks

Dropout in neural networks

ANN classifier applied on handwritten digits using scikit-learn

Introduction to deep learning

Summary

Recommendation Engines

Content-based filtering

Collaborative filtering

Evaluation of recommendation engine model

Unsupervised Learning

K-means clustering

Principal component analysis - PCA

Singular value decomposition - SVD

Deep auto encoders

Model building technique using encoder-decoder architecture

Deep auto encoders applied on handwritten digits using Keras

Summary

Reinforcement Learning

Introduction to reinforcement learning

Comparing supervised, unsupervised, and reinforcement learning in detail

Characteristics of reinforcement learning

Reinforcement learning basics

Markov decision processes and Bellman equations

Dynamic programming

Grid world example using value and policy iteration algorithms with basic Python

Monte Carlo methods

Temporal difference learning

SARSA on-policy TD control

Q-learning - off-policy TD control

Cliff walking example of on-policy and off-policy of TD control

Applications of reinforcement learning with integration of machine learning and deep learning

Machine learning model overview

Machine learning models are classified mainly into supervised, unsupervised, and reinforcement learning methods. We will be covering detailed discussions about each technique in later chapters; here is a very basic summary of them:

Supervised learning: This is where an instructor provides feedback to a student on whether they have performed well in an examination or not. In which target variable do present and models do get tune to achieve it. Many machine learning methods fall in to this category:
- Classification problems
- Logistic regression
- Lasso and ridge regression
- Decision trees (classification trees)
- Bagging classifier
- Random forest classifier
- Boosting classifier (adaboost, gradient boost, and xgboost)
- SVM classifier
- Recommendation engine
- Regression problems
- Linear regression (lasso and ridge regression)
- Decision trees (regression trees)
- Bagging regressor
- Random forest regressor
- Boosting regressor - (adaboost, gradient boost, and xgboost)
- SVM regressor
Unsupervised learning: Similar to the teacher-student analogy, in which the instructor does not present and provide feedback to the student and who needs to prepare on his/her own. Unsupervised learning does not have as many are in supervised learning:
- Principal component analysis (PCA)
- K-means clustering
Reinforcement learning: This is the scenario in which multiple decisions need to be taken by an agent prior to reaching the target and it provides a reward, either +1 or -1, rather than notifying how well or how badly the agent performed across the path:
- Markov decision process
- Monte Carlo methods
- Temporal difference learning
Logistic regression: This is the problem in which outcomes are discrete classes rather than continuous values. For example, a customer will arrive or not, he will purchase the product or not, and so on. In statistical methodology, it uses the maximum likelihood method to calculate the parameter of individual variables. In contrast, in machine learning methodology, log loss will be minimized with respect to β coefficients (also known as weights). Logistic regression has a high bias and a low variance error.
Linear regression: This is used for the prediction of continuous variables such as customer income and so on. It utilizes error minimization to fit the best possible line in statistical methodology. However, in machine learning methodology, squared loss will be minimized with respect to β coefficients. Linear regression also has a high bias and a low variance error.
Lasso and ridge regression: This uses regularization to control overfitting issues by applying a penalty on coefficients. In ridge regression, a penalty is applied on the sum of squares of coefficients, whereas in lasso, a penalty is applied on the absolute values of the coefficients. The penalty can be tuned in order to change the dynamics of the model fit. Ridge regression tries to minimize the magnitude of coefficients, whereas lasso tries to eliminate them.
Decision trees: Recursive binary splitting is applied to split the classes at each level to classify observations to their purest class. The classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class. Decision trees have an overfitting problem due to their high variance in a way to fit; pruning is applied to reduce the overfitting problem by growing the tree completely. Decision trees have low a bias and a high variance error.
Bagging: This is an ensemble technique applied on decision trees in order to minimize the variance error and at the same time not increase the error component due to bias. In bagging, various samples are selected with a subsample of observations and all variables (columns), subsequently fit individual decision trees independently on each sample and later ensemble the results by taking the maximum vote (in regression cases, the mean of outcomes calculated).
Random forest: This is similar to bagging except for one difference. In bagging, all the variables/columns are selected for each sample, whereas in random forest a few subcolumns are selected. The reason behind the selection of a few variables rather than all was that during each independent tree sampled, significant variables always came first in the top layer of splitting which makes all the trees look more or less similar and defies the sole purpose of ensemble: that it works better on diversified and independent individual models rather than correlated individual models. Random forest has both low bias and variance errors.
Boosting: This is a sequential algorithm that applies on weak classifiers such as a decision stump (a one-level decision tree or a tree with one root node and two terminal nodes) to create a strong classifier by ensembling the results. The algorithm starts with equal weights assigned to all the observations, followed by subsequent iterations where more focus was given to misclassified observations by increasing the weight of misclassified observations and decreasing the weight of properly classified observations. In the end, all the individual classifiers were combined to create a strong classifier. Boosting might have an overfitting problem, but by carefully tuning the parameters, we can obtain the best of the self machine learning model.
Support vector machines (SVMs): This maximizes the margin between classes by fitting the widest possible hyperplane between them. In the case of non-linearly separable classes, it uses kernels to move observations into higher-dimensional space and then separates them linearly with the hyperplane there.
Recommendation engine: This utilizes a collaborative filtering algorithm to identify high-probability items to its respective users, who have not used it in the past, by considering the tastes of similar users who would be using that particular item. It uses the alternating least squares (ALS) methodology to solve this problem.
Principal component analysis (PCA): This is a dimensionality reduction technique in which principal components are calculated in place of the original variable. Principal components are determined where the variance in data is maximum; subsequently, the top n components will be taken by covering about 80 percent of variance and will be used in further modeling processes, or exploratory analysis will be performed as unsupervised learning.
K-means clustering: This is an unsupervised algorithm that is mainly utilized for segmentation exercise. K-means clustering classifies the given data into k clusters in such a way that, within the cluster, variation is minimal and across the cluster, variation is maximal.
Markov decision process (MDP): In reinforcement learning, MDP is a mathematical framework for modeling decision-making of an agent in situations or environments where outcomes are partly random and partly under control. In this model, environment is modeled as a set of states and actions that can be performed by an agent to control the system's state. The objective is to control the system in such a way that the agent's total payoff is maximized.
Monte Carlo method: Monte Carlo methods do not require complete knowledge of the environment, in contrast with MDP. Monte Carlo methods require only experience, which is obtained by sample sequences of states, actions, and rewards from actual or simulated interaction with the environment. Monte Carlo methods explore the space until the final outcome of a chosen sample sequences and update estimates accordingly.
Temporal difference learning: This is a core theme in reinforcement learning. Temporal difference is a combination of both Monte Carlo and dynamic programming ideas. Similar to Monte Carlo, temporal difference methods can learn directly from raw experience without a model of the environment's dynamics. Like dynamic programming, temporal difference methods update estimates based in part on other learned estimates, without waiting for a final outcome. Temporal difference is the best of both worlds and is most commonly used in games such as AlphaGo and so on.

Statistics for Machine Learning

By : Pratap Dangeti

Statistics for Machine Learning

By: Pratap Dangeti

Overview of this book

Related Content you might be interested in

Current Title:

Statistics for Machine Learning

Mastering Machine Learning with scikit-learn

Ensemble Machine Learning Cookbook

Hands-On Automated Machine Learning

Machine learning model overview