In the last section, we got comfortable with the idea of supervised machine learning. Now, we will learn how exactly a machine learns underneath the hood. This section is going to examine a common optimization technique used by many machine learning algorithms, called **hill climbing**. It is predicated on the fact that each problem has an ideal state and a way to measure how close or how far we are from that. It is important to note that not all machine learning algorithms use this approach.

First, we'll cover loss functions, and then, prior to diving into hill climbing and descent, we'll take a quick math refresher.

### Note

There's going to be some math in this lesson, and while we try to shy away from the purely theoretical concepts, this is something that we simply have to get through in order to understand the guts of most of these algorithms. There will be a brief applied section at the end. Don't panic if you can't remember some of the calculus; just simply try to grasp what is happening behind the black box.

So, as mentioned before, a machine learning algorithm has to measure how close it is to some objective. We define this as a cost function, or a loss function. Sometimes, we hear it referred to as an objective function. Although not all machine learning algorithms are designed to directly minimize a loss function, we're going to learn the rule here rather than the exception. The point of a loss function is to determine the goodness of a model fit. It is typically evaluated over the course of a model's learning procedure and converges when the model has maximized its learning capacity.

A typical loss function computes a scalar value which is given by the true labels and the predicted labels. That is, given our actual *y* and our predicted *y*, which is

. This notation might be cryptic, but all it means is that some function, *L*, which we're going to call our loss function, is going to accept the ground truth, which is *y* and the predictions,

, and return some scalar value. The typical formula for the loss function is given as follows:

So, I've listed several common loss functions here, which may or may not look familiar. **Sum of Squared Error** (**SSE**) is a metric that we're going to be using for our regression models:

Cross entropy is a very commonly used classification metric:

In the following diagram, the *L* function on the left is simply indicating that it is our loss function over *y* and

given parameter theta. So, for any algorithm, we want to find the set of the theta parameters that minimize the loss. That is, if we're predicting the cost of a house, for example, we may want to estimate the cost per square foot as accurately as possible so as to minimize how wrong we are.

Parameters are often in a much higher dimensional space than can be represented visually. So, the big question we're concerned with is the following: How can we minimize the cost? It is typically not feasible for us to attempt every possible value to determine the true minimum of a problem. So, we have to find a way to descend this nebulous hill of loss. The tough part is that, at any given point, we don't know whether the curve goes up or down without some kind of evaluation. And that's precisely what we want to avoid, because it's very expensive:

We can describe this problem as waking up in a pitch-black room with an uneven floor and trying to find the lowest point in the room. You don't know how big the room is. You don't know how deep or how high it gets. Where do you step first? One thing we can do is to examine exactly where we stand and determine which direction around us slopes downward. To do that, we have to measure the slope of the curve.

The following is a quick refresher on scalar derivatives. To compute the slope at any given point, the standard way is to typically measure the slope of the line between the point we're interested in and some secant point, which we'll call delta *x*:

As the distance between *x* and its neighbor delta *x* approaches *0*, or as our limit approaches *0*, we arrive at the slope of the curve. This is given by the following formula:

There are several different notations that you may be familiar with. One is *f* prime of *x*. The slope of a constant is *0*. So, if *f(x)* is *9*, in other words, if *y* is simply *9*, it never changes. There is no slope. So, the slope is *0*, as shown:

We can also see the power law in effect here in the second example. This will come in useful later on. If we multiply the variable by the power, and decrement the power by one, we get the following:

In order to measure the slope of a vector or a multi-dimensional surface, we will introduce the idea of partial derivatives, which are simply derivatives with respect to a variable, with all the other variables held as constants. So, our solution is a vector of dimension *k*, where *k* is the number of variables that our function takes. In this case, we have *x* and *y*. Each respective position in the vector that we solve is a derivative with respect to the corresponding function's positional variable.

From a conceptual level, what we're doing is we're holding one of the variables still and changing the other variables around it to see how the slope changes. Our denominator's notation indicates which variable we're measuring the slope with, with respect to that point. So, in this case, the first position, *d(x)*, is showing that we're taking the partial derivative of function *f* with respect to *x*, where we hold *y* constant. And then, likewise, in the second one, we're taking the derivative of function f with respect to *y*, holding *x* constant. So, what we get in the end is called a gradient, which is a super keyword. It is simply just a vector of partial derivatives:

We want to get really complicated, though, and measure the slopes of multiple functions at the same time. All we'll end up with is a matrix of gradients along the rows. In the following formula, we can see the solution that we just solved from the previous example:

In the next formula, we have introduced this new function, called *g*. We see the gradient for function *g*, with each position corresponding to the partial derivative with respect to the variables *x* and *y*:

When we stack these together into a matrix, what we get is a Jacobian. You don't need to solve this, but you should understand that what we're doing is taking the slope of a multi-dimensional surface. You can treat it as a bit of a black box as long as you understand that. This is exactly how we're computing the gradient and the Jacobian: