## Stochastic gradient descent

The method we've just seen of calculating gradient descent is often called **batch gradient descent**, because each update to the coefficients happens inside an iteration over all the data in a *single batch*. With very large amounts of data, each iteration can be time-consuming and waiting for convergence could take a very long time.

An alternative method of gradient descent is called **stochastic gradient descent** or **SGD**. In this method, the estimates of the coefficients are continually updated as the input data is processed. The update method for stochastic gradient descent looks like this:

In fact, this is identical to batch gradient descent. The difference in application is purely that expression is calculated over a *mini-batch*â€”a random smaller subset of the overall data. The mini-batch size should be large enough to represent a fair sample of the input recordsâ€”for our data, a reasonable mini-batch size might be about 250.

Stochastic gradient descent arrives at the...