Book Image

Effective Amazon Machine Learning

By : Alexis Perrier
Book Image

Effective Amazon Machine Learning

By: Alexis Perrier

Overview of this book

Predictive analytics is a complex domain requiring coding skills, an understanding of the mathematical concepts underpinning machine learning algorithms, and the ability to create compelling data visualizations. Following AWS simplifying Machine learning, this book will help you bring predictive analytics projects to fruition in three easy steps: data preparation, model tuning, and model selection. This book will introduce you to the Amazon Machine Learning platform and will implement core data science concepts such as classification, regression, regularization, overfitting, model selection, and evaluation. Furthermore, you will learn to leverage the Amazon Web Service (AWS) ecosystem for extended access to data sources, implement realtime predictions, and run Amazon Machine Learning projects via the command line and the Python SDK. Towards the end of the book, you will also learn how to apply these services to other problems, such as text mining, and to more complex datasets.
Table of Contents (17 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Dedication
Preface

Diving further into linear modeling for prediction


Amazon ML is based on linear modeling. Recall the equation for a straight line in the plan:

This linear equation with coefficients (a, b) can be interpreted as a predictive linear model with x as the predictor and y as the outcome. In this simple case, we have two parameters (a, b) and one predictor x. An example can be that of predicting the height of children with respect to their weight and find some a and b such that the following equation is true:

Let's consider the classic Lewis Taylor (1967) dataset with 237 samples of children's age, weight, height, and gender (https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm) and focus on the relation between the height and weight of the children. In this dataset, the optimal regression line follows the following equation:

The following figure illustrates the height versus weight dataset and the associated linear regression:

Consider now that we have not one predictor but several, and let's generalize the preceding linear equation to N predictors denoted by {x1, . . . , xn} and N +1 coefficients or {wo, w1, . . ., wn} weights. The linear model can be written as follows:

Here, ŷ denotes the predicted value, (y would correspond to the true value to be predicted). To simplify notations, we will assume for the rest of the book the coefficient wo = 0.

This equation can be rewritten in vector form as follows:

Where T is the transpose operator, X = {x1, . . ., xn} and  W= {w1, . . .,wn} are the respective vectors of predictors and model weights. Under certain conditions, the coefficients wi can be calculated precisely. However, for a large number of samples N, these calculations are expensive in terms of required computations as they involve inverting matrices of dimension N, which for large datasets is costly and slow. As the number of samples grows, it becomes more efficient to estimate these model coefficients via an iterative process.

The Stochastic Gradient Descent algorithm iteratively estimates the coefficients {wo, w1, . . ., wn} of the model. At each iteration, it uses a random sample of the training dataset for which the real outcome value is known. The SGD algorithm works by minimizing a function of the prediction error:

Functions that take the prediction error as argument are also called loss functions. Different loss functions result in different algorithms. A convex loss function has a unique minimum, which corresponds to the optimal set of weights for the regression problem. We will come back to the SGD algorithm in details in later chapters. Suffice to say for now that the SGD algorithm is especially well-suited to deal with large datasets.

There are many reasons to justify selecting the SGD algorithm for general purpose predictive analysis problems:

  • It is robust
  • Its convergence properties have been extensively studied and are well known
  • It is well adapted to optimization techniques
  • It has many extensions and variants
  • It has low computational cost
  • It can be applied to regression, classification, and streaming data

Some weaknesses include the following:

  • The need to properly initialize its parameters
  • A convergence rate dependent on a parameter called the learning rate

Validating the dataset

Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.

In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):

  • Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
  • Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
  • No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
  • Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
  • Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.

These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.

Missing from Amazon ML

Amazon ML offers supervised learning predictions for classification (binary and multiclass) and regression problems. It offers some very basic visualization of the original data and has a preset list of data transformations, such as binning or normalizing the data. It is efficient and simple. However, several functionalities that are important to the data scientist are unfortunately missing from the platform. Lacking these features may not be a deal breaker, but it nonetheless restricts the scope of problems Amazon ML can be applied to.

Some of the common machine learning features Amazon ML does not offer are as follows:

  • Unsupervised learning: It is not possible to do clustering or dimensionality reduction of your data.
  • A choice of models beside linear models: Non-linear Support Vector Machines, any type of Bayes classification, neural networks, and tree, based algorithms (decision trees, random forests, or boosted trees) are all absent models. All predictions, all experiments will be built on linear regression and logistic regression with the SGD.
  • Data visualization capabilities are reduced to histograms and density plots.
  • A choice of metrics: Amazon ML uses F1-score and ROC-AUC metrics for classification, and MSE for regression. It is not possible to assess the model performance with any other metric.
  • You cannot download your trained model and use it anywhere else than Amazon ML.

Finally, although it is not possible to directly use your own scripts (R, Python, Scala, and so on) within the Amazon ML platform, it is possible and recommended to use other AWS services, such as AWS Lambda, to preprocess the datasets. Data manipulation beyond the transformations available in Amazon ML can also be carried out with SQL if your data is stored in one of the AWS SQL enabled services (Athena, RDS, Redshift, and others).

The statistical approach versus the machine learning approach

In 2001, Leo Breiman published a paper titled Statistical Modeling: The Two Cultures (http://projecteuclid.org/euclid.ss/1009213726) that underlined the differences between the statistical approach focused on validation and explanation of the underlying process in the data and the machine learning approach, which is more concerned with the results.

Roughly put, a classic statistical analysis follows steps such as the following:

  1. A hypothesis called the null hypothesis is stated. This null hypothesis usually states that the observation is due to randomness.
  2. The probability (or p-value) of the event under the null hypothesis is then calculated.
  3. If that probability is below a certain threshold (usually p < 0.05), then the null hypothesis is rejected, which means that the observation is not a random fluke.

Note

p> 0.05 does not imply that the null hypothesis is true. It only means that you cannot reject it, as the probability of the observation happening by chance is not large enough.

This methodology is geared toward explaining and discovering the influencing factors of the phenomenon. The goal here is to establish/build a somewhat static and fully known model that will fit observations as well as possible and, therefore, will be able to predict future patterns, behaviors, and observations.

In the machine learning approach, in predictive analytics, an explicit representation of the model is not the focus. The goal is to build the best model for the prediction period, and the model builds itself from the observations. The internals of the models are not explicit. This machine learning approach is called a black box model.

By removing the need for explicit modeling of the data, the ML approach has a stronger potential for predictions. ML is focused on making the most accurate predictions possible by minimizing the prediction error of a model at the expense of explainability.