-
Book Overview & Buying
-
Table Of Contents
Mastering Machine Learning Algorithms - Second Edition
By :
Many machine learning problems can be expressed throughout a proxy function that measures the training error. The obvious implicit assumption is that, by reducing both training and validation errors, the accuracy increases, and the algorithm reaches its objective.
If we consider a supervised scenario (many considerations hold also for semi-supervised ones), with finite datasets X and Y:


We can define the generic loss function for a single data point as:

J is a function of the whole parameter set and must be proportional to the error between the true label and the predicted label.
A very important property of a loss function is convexity. In many real cases, this is an almost impossible condition; however, it's always useful to look for convex loss functions, because they can be easily optimized through the gradient descent method. We're going to discuss this topic in Chapter 10, Introduction to Time-Series Analysis.
However, for now, it's useful to consider a loss function as an intermediary between our training process and a pure mathematical optimization. The missing link is the complete data. As already discussed, X is drawn from pdata, so it should represent the true distribution. Therefore, when we minimize the loss function, we're considering a potential subset of points, and never the whole real dataset.
In many cases, this isn't a limitation. If the bias is null and the variance is small enough, the resulting model will show a good generalization ability, with high training and validation accuracy; however, considering the data generating process, it's useful to introduce another measure called expected risk:

This value can be interpreted as an average of the loss function over all possible samples drawn from pdata. However, as pdata is generally continuous, it's necessary to consider an expected value and integrate over all possible
couples, which is often an intractable problem. The minimization of the expected risk implies the maximization of global accuracy, which, in turn, corresponds to the optimal outcome.
On the other hand, in real-world scenarios we work with a finite number of training samples. Since that's the case, it's preferable to define a cost function, which is often called a loss function as well, and is not to be confused with the log-likelihood:

This is the actual function that we're going to minimize. Divided by the number of samples (a factor that doesn't have any impact) it's also called empirical risk. It's called that because it's an approximation, based on a finite sample X, of the expected risk. In other words, we want to find a set of parameters so that:

When the cost function has more than two parameters, it's very difficult and perhaps even impossible to understand its internal structure. However, we can analyze some potential conditions using a bidimensional diagram:

Different kinds of points in a bidimensional scenario
The different situations we can observe are:
Even if local minima are likely when the model has a small number of parameters, they become very unlikely when the model has a large number of parameters. In fact, an n-dimensional point
is a local minimum for a convex function (and here, we're assuming L to be convex) only if:

The second condition imposes a positive definite Hessian matrix—equivalently, all principal minors
made with the first n rows and n columns must be positive—therefore all its eigenvalues
must be positive. This probability decreases with the number of parameters (
is an n × n square matrix and has n eigenvalues), and becomes close to zero in deep learning models where the number of weights can be in the order of millions, or even more. The reader interested in a complete mathematical proof can read Dube S., High Dimensional Spaces, Deep Learning and Adversarial Examples, arXiv:1801.00634 [cs.CV].
As a consequence, a more common condition to consider is instead the presence of saddle points, where the eigenvalues have different signs and the orthogonal directional derivatives are null, even if the points are neither local maxima nor local minima. Consider, for example, the following plot:

Saddle point in a three-dimensional scenario
The surface is very similar in shape to a horse saddle, and if we project the point on an orthogonal plane, XZ is a minimum, while on another plane (YZ) it is a maximum. It's straightforward that saddle points are quite dangerous, because many simpler optimization algorithms can slow down and even stop, losing the ability to find the right direction. In Chapter 18, Optimizing Neural Networks, we're going to discuss some methods that are able to mitigate this kind of problem, allowing deep models to converge.
In this section, we'll discuss some common cost functions that are employed in both classification and regression tasks. Some of them will be repeatedly adopted in our examples over the next few chapters, particularly when we're discussing training processes in shallow and deep neural networks.
Mean squared error is one of the most common regression cost functions. Its generic expression is:

This function is differentiable at every point of its domain and it's convex, so it can be optimized using the stochastic gradient descent (SGD) algorithm. This cost function is fundamental in regression analysis using the Ordinary or Generalized Least Square algorithms, where, for example, it's possible to prove that the estimators are always unbiased.
However, there's a drawback when employing it in regression tasks with outliers. Its value is always quadratic, and therefore, when the distance between the prediction and an actual outlier value is large, the relative error is large, and this can lead to an unacceptable correction.
As explained, mean squared error isn't robust to outliers, because it's always quadratic, independent of the distance between actual value and prediction. To overcome this problem, it's possible to employ the Huber cost function, which is based on threshold tH, so that for distances less than tH its behavior is quadratic, while for a distance greater than tH it becomes linear, reducing the entity of the error and thereby reducing the relative importance of the outliers.
The analytical expression is:

This cost function is adopted by Support Vector Machine (SVM) algorithms, where the goal is to maximize the distance between the separation boundaries, where the support vectors lie. Its analytic expression is:

Contrary to the other examples, this cost function is not optimized using classic SGD methods, because it's not differentiable when:

For this reason, SVM algorithms are optimized using quadratic programming techniques.
Categorical cross-entropy is the most diffused classification cost function, adopted by logistic regression and the majority of neural architectures. The generic analytical expression is:

This cost function is convex and can easily be optimized using SGD techniques. According to information theory, the entropy of a distribution p is a measure of the amount of uncertainty—expressed in bit if
is used and in nat is natural logarithm is employed—carried by a particular probability distribution. For example, if we toss a fair coin, then
according to a Bernoulli distribution. Therefore, the entropy of such a discrete distribution is:

The uncertainty is equal to 1 bit, which means that before any experiment there are 2 possible outcomes, which is obvious. What happens if we know that the coin is loaded, and P(Head) = 0.1 and P(Tail) = 0.9? The entropy is now:

As our uncertainty is now very low—we know that in 90% of the cases, the outcome will be tails—the entropy is less than half a bit. The concept must be interpreted as a continuous measure, which means that a single outcome is very likely to appear. The extreme case when P(Head) → 0 and P(Tail) → 1 asymptotically requires 0 bits, because there's no more uncertainty and the probability distribution is degenerate.
The concept of entropy, as well as cross-entropy and other information theory formulas, can be extended to continuous distributions using an integral instead of a sum. For example, the entropy of a normal distribution
is
. In general, the entropy is proportional to the variance or to the spread of the distribution.
This is an intuitive concept, in fact; the larger the variance, the larger the region in which the potential outcomes have a similar probability to be selected. As the uncertainty grows when the potential set of candidate outcomes does, the price in bits or nats we need to pay to remove the uncertainty increases. In the example of the coin, we needed to pay 1 bit for a fair outcome and only 0.47 bit when we already knew that P(Tail) = 0.9; our uncertainty was quite a lot lower.
Analogously, cross-entropy is defined between two distributions p and q:

Let's suppose that p is the data generating process we are working with. What's the meaning of H(P, q)? Considering the expression, what we're doing is computing the expected value
, while the entropy computed the expected value
. Therefore, if q is an approximation of p, the cross-entropy measures the additional uncertainty that we are introducing when using the model q instead of the original data generating process. It's not difficult to understand that we increase the uncertainty by only using an approximation of the true underlying process.
In fact, if we're training a classifier, our goal is to create a model whose distribution is as similar as possible to pdata. This condition can be achieved by minimizing the Kullback-Leibler divergence between the two distributions:

In the previous expression, pM is the distribution generated by the model. Now, if we rewrite the divergence, we get:

The first term is the entropy of the data generating distribution, and it doesn't depend on the model parameters, while the second one is the cross-entropy. Therefore, if we minimize the cross-entropy, we also minimize the Kullback-Leibler divergence, forcing the model to reproduce a distribution that is very similar to pdata. That means that we are reducing the additional uncertainty that was induced by the approximation. This is a very elegant explanation as to why the cross-entropy cost function is an excellent choice for classification problems.