Statistics for Machine Learning

Statistics for Machine Learning

By : Pratap Dangeti

Buy this Book

Statistics for Machine Learning

By: Pratap Dangeti

Buy this Book

Overview of this book

Complex statistics in machine learning worry a lot of developers. Knowing statistics helps you build strong machine learning models that are optimized for a given problem statement. This book will teach you all it takes to perform the complex statistical computations that are required for machine learning. You will gain information on the statistics behind supervised learning, unsupervised learning, reinforcement learning, and more. You will see real-world examples that discuss the statistical side of machine learning and familiarize yourself with it. You will come across programs for performing tasks such as modeling, parameter fitting, regression, classification, density collection, working with vectors, matrices, and more. By the end of the book, you will have mastered the statistics required for machine learning and will be able to apply your new skills to any sort of industry problem.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Journey from Statistics to Machine Learning

Statistical terminology for model building and validation

Machine learning terminology for model building and validation

Machine learning model overview

Summary

Parallelism of Statistics and Machine Learning

Comparison between regression and machine learning models

Compensating factors in machine learning models

Machine learning models - ridge and lasso regression

Summary

Logistic Regression Versus Random Forest

Maximum likelihood estimation

Logistic regression – introduction and advantages

Random forest

Variable importance plot

Comparison of logistic regression with random forest

Summary

Tree-Based Machine Learning Models

Introducing decision tree classifiers

Comparison between logistic regression and decision trees

Comparison of error components across various styles of models

Remedial actions to push the model towards the ideal region

HR attrition data example

Decision tree classifier

Tuning class weights in decision tree classifier

Bagging classifier

Random forest classifier

Random forest classifier - grid search

AdaBoost classifier

Gradient boosting classifier

Comparison between AdaBoosting versus gradient boosting

Extreme gradient boosting - XGBoost classifier

Ensemble of ensembles - model stacking

Ensemble of ensembles with different types of classifiers

Ensemble of ensembles with bootstrap samples using a single type of classifier

Summary

K-Nearest Neighbors and Naive Bayes

K-nearest neighbors

KNN classifier with breast cancer Wisconsin data example

Tuning of k-value in KNN classifier

Naive Bayes

Probability fundamentals

Understanding Bayes theorem with conditional probability

Naive Bayes classification

Laplace estimator

Naive Bayes SMS spam classification example

Summary

Support Vector Machines and Neural Networks

Support vector machines working principles

Kernel functions

SVM multilabel classifier with letter recognition data example

Artificial neural networks - ANN

Activation functions

Forward propagation and backpropagation

Optimization of neural networks

Dropout in neural networks

ANN classifier applied on handwritten digits using scikit-learn

Introduction to deep learning

Summary

Recommendation Engines

Content-based filtering

Collaborative filtering

Evaluation of recommendation engine model

Unsupervised Learning

K-means clustering

Principal component analysis - PCA

Singular value decomposition - SVD

Deep auto encoders

Model building technique using encoder-decoder architecture

Deep auto encoders applied on handwritten digits using Keras

Summary

Reinforcement Learning

Introduction to reinforcement learning

Comparing supervised, unsupervised, and reinforcement learning in detail

Characteristics of reinforcement learning

Reinforcement learning basics

Markov decision processes and Bellman equations

Dynamic programming

Grid world example using value and policy iteration algorithms with basic Python

Monte Carlo methods

Temporal difference learning

SARSA on-policy TD control

Q-learning - off-policy TD control

Cliff walking example of on-policy and off-policy of TD control

Applications of reinforcement learning with integration of machine learning and deep learning

Preface

Complex statistics in machine learning worry a lot of developers. Knowing statistics helps you build strong machine learning models that are optimized for a given problem statement. I believe that any machine learning practitioner should be proficient in statistics as well as in mathematics, so that they can speculate and solve any machine learning problem in an efficient manner. In this book, we will cover the fundamentals of statistics and machine learning, giving you a holistic view of the application of machine learning techniques for relevant problems. We will discuss the application of frequently used algorithms on various domain problems, using both Python and R programming. We will use libraries such as scikit-learn, e1071, randomForest, c50, xgboost, and so on. We will also go over the fundamentals of deep learning with the help of Keras software. Furthermore, we will have an overview of reinforcement learning with pure Python programming language.

The book is motivated by the following goals:

To help newbies get up to speed with various fundamentals, whilst also allowing experienced professionals to refresh their knowledge on various concepts and to have more clarity when applying algorithms on their chosen data.
To give a holistic view of both Python and R, this book will take you through various examples using both languages.
To provide an introduction to new trends in machine learning, fundamentals of deep learning and reinforcement learning are covered with suitable examples to teach you state of the art techniques.

What this book covers

Chapter 1, Journey from Statistics to Machine Learning, introduces you to all the necessary fundamentals and basic building blocks of both statistics and machine learning. All fundamentals are explained with the support of both Python and R code examples across the chapter.

Chapter 2, Parallelism of Statistics and Machine Learning, compares the differences and draws parallels between statistical modeling and machine learning using linear regression and lasso/ridge regression examples.

Chapter 3, Logistic Regression Versus Random Forest, describes the comparison between logistic regression and random forest using a classification example, explaining the detailed steps in both modeling processes. By the end of this chapter, you will have a complete picture of both the streams of statistics and machine learning.

Chapter 4, Tree-Based Machine Learning Models, focuses on the various tree-based machine learning models used by industry practitioners, including decision trees, bagging, random forest, AdaBoost, gradient boosting, and XGBoost with the HR attrition example in both languages.

Chapter 5, K-Nearest Neighbors and Naive Bayes, illustrates simple methods of machine learning. K-nearest neighbors is explained using breast cancer data. The Naive Bayes model is explained with a message classification example using various NLP preprocessing techniques.

Chapter 6, Support Vector Machines and Neural Networks, describes the various functionalities involved in support vector machines and the usage of kernels. It then provides an introduction to neural networks. Fundamentals of deep learning are exhaustively covered in this chapter.

Chapter 7, Recommendation Engines, shows us how to find similar movies based on similar users, which is based on the user-user similarity matrix. In the second section, recommendations are made based on the movie-movies similarity matrix, in which similar movies are extracted using cosine similarity. And, finally, the collaborative filtering technique that considers both users and movies to determine recommendations, is applied, which is utilized alternating the least squares methodology.

Chapter 8, Unsupervised Learning, presents various techniques such as k-means clustering, principal component analysis, singular value decomposition, and deep learning based deep auto encoders. At the end is an explanation of why deep auto encoders are much more powerful than the conventional PCA techniques.

Chapter 9, Reinforcement Learning, provides exhaustive techniques that learn the optimal path to reach a goal over the episodic states, such as the Markov decision process, dynamic programming, Monte Carlo methods, and temporal difference learning. Finally, some use cases are provided for superb applications using machine learning and reinforcement learning.

What you need for this book

This book assumes that you know the basics of Python and R and how to install the libraries. It does not assume that you are already equipped with the knowledge of advanced statistics and mathematics, like linear algebra and so on.

The following versions of software are used throughout this book, but it should run fine with any more recent ones as well:

Anaconda 3–4.3.1 (all Python and its relevant packages are included in Anaconda, Python 3.6.1, NumPy 1.12.1, Pandas 0.19.2, and scikit-learn 0.18.1)
R 3.4.0 and RStudio 1.0.143
Theano 0.9.0
Keras 2.0.2

Who this book is for

This book is intended for developers with little to no background in statistics who want to implement machine learning in their systems. Some programming knowledge in R or Python will be useful.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The mode function was not implemented in the numpy package.". Any command-line input or output is written as follows:

>>> import numpy as np 
>>> from scipy import stats 
>>> data = np.array([4,5,1,2,7,2,6,9,3]) 
# Calculate Mean 
>>> dt_mean = np.mean(data) ; 
print ("Mean :",round(dt_mean,2))

New terms and important words are shown in bold.

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you thought about this book-what you liked or disliked. Reader feedback is important for us as it helps us to develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Statistics-for-Machine-Learning. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in given outputs. You can download this file from https://www.packtpub.com/sites/default/files/downloads/StatisticsforMachineLearning_ColorImages.pdf.

Errata

Although we have taken care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspects of this book, you can contact us at [email protected], and we will do our best to address it.

Statistics for Machine Learning

By : Pratap Dangeti

Statistics for Machine Learning

By: Pratap Dangeti

Overview of this book

Related Content you might be interested in

Current Title:

Statistics for Machine Learning

Mastering Machine Learning with scikit-learn

Ensemble Machine Learning Cookbook

Hands-On Automated Machine Learning

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Note

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions