Machine Learning for Finance

Machine Learning for Finance

By : Jannes Klaas

Buy this Book

Machine Learning for Finance

By: Jannes Klaas

Buy this Book

Overview of this book

Machine Learning for Finance explores new advances in machine learning and shows how they can be applied across the financial sector, including insurance, transactions, and lending. This book explains the concepts and algorithms behind the main machine learning techniques and provides example Python code for implementing the models yourself. The book is based on Jannes Klaas’ experience of running machine learning training courses for financial professionals. Rather than providing ready-made financial algorithms, the book focuses on advanced machine learning concepts and ideas that can be applied in a wide variety of ways. The book systematically explains how machine learning works on structured data, text, images, and time series. You'll cover generative adversarial learning, reinforcement learning, debugging, and launching machine learning products. Later chapters will discuss how to fight bias in machine learning. The book ends with an exploration of Bayesian inference and probabilistic programming.

Machine Learning for Finance

Contributors

Preface

Other Books You May Enjoy

Free Chapter

Neural Networks and Gradient-Based Optimization

Our journey in this book

What is machine learning?

Supervised learning

Unsupervised learning

Reinforcement learning

Setting up your workspace

Using Kaggle kernels

Using the AWS deep learning AMI

Approximating functions

A forward pass

A logistic regressor

Optimizing model parameters

Measuring model loss

A deeper network

A brief introduction to Keras

Tensors and the computational graph

Exercises

Summary

Applying Machine Learning to Structured Data

The data

Heuristic, feature-based, and E2E models

The machine learning software stack

The heuristic approach

The feature engineering approach

Preparing the data for the Keras library

Creating predictive models with Keras

A brief primer on tree-based methods

E2E modeling

Exercises

Summary

Utilizing Computer Vision

Convolutional Neural Networks

Filters on color images

The building blocks of ConvNets in Keras

More bells and whistles for our neural network

Working with big image datasets

Working with pretrained models

The modularity tradeoff

Computer vision beyond classification

Exercises

Summary

Understanding Time Series

Visualization and preparation in pandas

Fast Fourier transformations

Autocorrelation

Establishing a training and testing regime

A note on backtesting

Median forecasting

ARIMA

Kalman filters

Forecasting with neural networks

Conv1D

Dilated and causal convolution

Simple RNN

LSTM

Recurrent dropout

Bayesian deep learning

Exercises

Summary

Parsing Textual Data with Natural Language Processing

An introductory guide to spaCy

Named entity recognition

Part-of-speech (POS) tagging

Rule-based matching

Regular expressions

A text classification task

A quick tour of the Keras functional API

Attention

Seq2seq models

Exercises

Summary

Using Generative Models

Understanding autoencoders

Visualizing latent spaces with t-SNE

Variational autoencoders

VAEs for time series

GANs

Using less data – active learning

SGANs for fraud detection

Exercises

Summary

Reinforcement Learning for Financial Markets

Catch – a quick guide to reinforcement learning

Markov processes and the bellman equation – A more formal introduction to RL

Advantage actor-critic models

Evolutionary strategies and genetic algorithms

Practical tips for RL engineering

Frontiers of RL

Exercises

Summary

Privacy, Debugging, and Launching Your Products

Summary

Fighting Bias

Sources of unfairness in machine learning

Legal perspectives

Observational fairness

Training to be fair

Causal learning

Interpreting models to ensure fairness

Unfairness as complex system failure

A checklist for developing fair models

Exercises

Summary

Bayesian Inference and Probabilistic Programming

An intuitive guide to Bayesian inference

Summary

Farewell

Measuring model loss

We saw earlier how we could optimize parameters by minimizing some distance function, D. This distance function, also called the loss function, is the performance measure by which we evaluate possible functions. In machine learning, a loss function measures how bad the model performs. A high loss function goes hand in hand with low accuracy, whereas if the function is low, then the model is doing well.

In this case, our issue is a binary classification problem. Because of that, we will be using the binary cross-entropy loss, as we can see in the following formula:

Let's go through this formula step by step:

D_BCE: This is the distance function for binary cross entropy loss.
: The loss over a batch of N examples is the average loss of all examples.
: This part of the loss only comes into play if the true value, y_i is 1. If y_i is 1, we want to be as close to 1 as possible, so we can achieve a low loss.
: This part of the loss comes into play if y_i, is 0. If so, we want to be close to 0 as well.

In Python, this loss function is implemented as follows:

def bce_loss(y,y_hat):
  N = y.shape[0]
  loss = -1/N * (y*np.log(y_hat) + (1 - y)*np.log(1-y_hat))
  return loss

The output, A, of our logistic regressor is equal to , so we can calculate the binary cross-entropy loss as follows:

loss = bce_loss(y,A)
print(loss)

out: 
0.82232258208779863

As we can see, this is quite a high loss, so we should now look at seeing how we can improve our model. The goal here is to bring this loss to zero, or at least to get closer to zero.

You can think of losses with respect to different function hypotheses as a surface, sometimes also called the "loss surface." The loss surface is a lot like a mountain range, as we have high points on the mountain tops and low points in valleys.

Our goal is to find the absolute lowest point in the mountain range: the deepest valley, or the "global minimum." A global minimum is a point in the function hypothesis space at which the loss is at the lowest point.

A "local minimum," by contrast, is the point at which the loss is lower than in the immediately surrounding space. Local minima are problematic because while they might seem like a good function to use at face value, there are much better functions available. Keep this in mind as we now walk through gradient descent, a method for finding a minimum in our function space.

Gradient descent

Now that we know what we judge our candidate models, , by, how do we tweak the parameters to obtain better models? The most popular optimization algorithm for neural networks is called gradient descent. Within this method, we slowly move along the slope, the derivative, of the loss function.

Imagine you are in a mountain forest on a hike, and you're at a point where you've lost the track and are now in the woods trying to find the bottom of the valley. The problem here is that because there are so many trees, you cannot see the valley's bottom, only the ground under your feet.

Now ask yourself this: how would you find your way down? One sensible approach would be to follow the slope, and where the slope goes downwards, you go. This is the same approach that is taken by a gradient descent algorithm.

To bring it back to our focus, in this forest situation the loss function is the mountain, and to get to a low loss, the algorithm follows the slope, that is, the derivative, of the loss function. When we walk down the mountain, we are updating our location coordinates.

The algorithm updates the parameters of the neural network, as we are seeing in the following diagram:

Gradient descent

Gradient descent requires that the loss function has a derivative with respect to the parameters that we want to optimize. This will work well for most supervised learning problems, but things become more difficult when we want to tackle problems for which there is no obvious derivative.

Gradient descent can also only optimize the parameters, weights, and biases of our model. What it cannot do is optimize how many layers our model has or which activation functions it should use, since there is no way to compute the gradient with respect to model topology.

These settings, which cannot be optimized by gradient descent, are called hyperparameters and are usually set by humans. You just saw how we gradually scale down the loss function, but how do we update the parameters? To this end, we're going to need another method called backpropagation.

Backpropagation

Backpropagation allows us to apply gradient descent updates to the parameters of a model. To update the parameters, we need to calculate the derivative of the loss function with respect to the weights and biases.

If you imagine the parameters of our models are like the geo-coordinates in our mountain forest analogy, calculating the loss derivative with respect to a parameter is like checking the mountain slope in the direction north to see whether you should go north or south.

The following diagram shows the forward and backward pass through a logistic regressor:

Forward and backward pass through a logistic regressor

To keep things simple, we refer to the derivative of the loss function to any variable as the d variable. For example, we'll write the derivative of the loss function with respect to the weights as dW.

To calculate the gradient with respect to different parameters of our model, we can make use of the chain rule. You might remember the chain rule as the following:

This is also sometimes written as follows:

The chain rule basically says that if you want to take the derivative through a number of nested functions, you multiply the derivative of the inner function with the derivative of the outer function.

This is useful because neural networks, and our logistic regressor, are nested functions. The input goes through the linear step, a function of input, weights, and biases; and the output of the linear step, z, goes through the activation function.

So, when we compute the loss derivative with respect to weights and biases, we'll first compute the loss derivative with respect to the output of the linear step, z, and use it to compute dW. Within the code, it looks like this:

dz = (A - y)

dW = 1/N * np.dot(X.T,dz)

db = 1/N * np.sum(dz,axis=0,keepdims=True)

Parameter updates

Now we have the gradients, how do we improve our model? Going back to our mountain analogy, now that we know that the mountain goes up in the north and east directions, where do we go? To the south and west, of course!

Mathematically speaking, we go in the opposite direction to the gradient. If the gradient is positive with respect to a parameter, that is, the slope is upward, then we reduce the parameter. If it is negative, that is, downward sloping, we increase it. When our slope is steeper, we move our gradient more.

The update rule for a parameter, p, then goes like this:

Here p is a model parameter (either in weight or a bias), dp is the loss derivative with respect to p, and is the learning rate.

The learning rate is something akin to the gas pedal within a car. It sets by how much we want to apply the gradient updates. It is one of those hyperparameters that we have to set manually, and something we will discuss in the next chapter.

Within the code, our parameter updates look like this:

alpha = 1
W -= alpha * dW
b -= alpha * db

Putting it all together

Well done! We've now looked at all the parts that are needed in order to train a neural network. Over the next few steps in this section, we will be training a one-layer neural network, which is also called a logistic regressor.

Firstly, we'll import numpy before we define the data. We can do this by running the following code:

import numpy as np
np.random.seed(1)

X = np.array([[0,1,0],
              [1,0,0],
              [1,1,1],
              [0,1,1]])

y = np.array([[0,1,1,0]]).T

The next step is for us to define the sigmoid activation function and loss function, which we can do with the following code:

def sigmoid(x):
    return 1/(1+np.exp(-x))

def bce_loss(y,y_hat):
    N = y.shape[0]
    loss = -1/N * np.sum((y*np.log(y_hat) + (1 - y)*np.log(1-y_hat)))
    return loss

We'll then randomly initialize our model, which we can achieve with the following code:

W = 2*np.random.random((3,1)) - 1
b = 0

As part of this process, we also need to set some hyperparameters. The first one is alpha, which we will just set to 1 here. Alpha is best understood as the step size. A large alpha means that while our model will train quickly, it might also overshoot the target. A small alpha, in comparison, allows gradient descent to tread more carefully and find small valleys it would otherwise shoot over.

The second one is the number of times we want to run the training process, also called the number of epochs we want to run. We can set the parameters with the following code:

alpha = 1
epochs = 20

Since it is used in the training loop, it's also useful to define the number of samples in our data. We'll also define an empty array in order to keep track of the model's losses over time. To achieve this, we simply run the following:

N = y.shape[0]
losses = []

Now we come to the main training loop:

for i in range(epochs):
    # Forward pass
    z = X.dot(W) + b 
    A = sigmoid(z)
    
    # Calculate loss
    loss = bce_loss(y,A)
    print('Epoch:',i,'Loss:',loss)
    losses.append(loss)
    
    # Calculate derivatives
    dz = (A - y)
    dW = 1/N * np.dot(X.T,dz)
    db = 1/N * np.sum(dz,axis=0,keepdims=True)    
    
    # Parameter updates
    W -= alpha * dW
    b -= alpha * db

As a result of running the previous code, we would get the following output:

out: 
Epoch: 0 Loss: 0.822322582088
Epoch: 1 Loss: 0.722897448125
Epoch: 2 Loss: 0.646837651208
Epoch: 3 Loss: 0.584116122241
Epoch: 4 Loss: 0.530908161024
Epoch: 5 Loss: 0.48523717872
Epoch: 6 Loss: 0.445747750118
Epoch: 7 Loss: 0.411391164148
Epoch: 8 Loss: 0.381326093762
Epoch: 9 Loss: 0.354869998127
Epoch: 10 Loss: 0.331466036109
Epoch: 11 Loss: 0.310657702141
Epoch: 12 Loss: 0.292068863232
Epoch: 13 Loss: 0.275387990352
Epoch: 14 Loss: 0.260355695915
Epoch: 15 Loss: 0.246754868981
Epoch: 16 Loss: 0.234402844624
Epoch: 17 Loss: 0.22314516463
Epoch: 18 Loss: 0.21285058467
Epoch: 19 Loss: 0.203407060401

You can see that over the course of the output, the loss steadily decreases, starting at 0.822322582088 and ending at 0.203407060401.

We can plot the loss to a graph in order to give us a better look at it. To do this, we can simply run the following code:

import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

This will then output the following chart:

The output of the previous code, showing loss rate improving over time

Machine Learning for Finance

By : Jannes Klaas

Machine Learning for Finance

By: Jannes Klaas

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning for Finance

Deep Learning Quick Reference

Hands-On Neural Networks with Keras

Hands-On Machine Learning for Algorithmic Trading

Measuring model loss

Gradient descent

Backpropagation

Parameter updates

Putting it all together