Machine Learning for Algorithmic Trading - Second Edition

By : Stefan Jansen

Machine Learning for Algorithmic Trading - Second Edition

By: Stefan Jansen

Overview of this book

The explosive growth of digital data has boosted the demand for expertise in trading strategies that use machine learning (ML). This revised and expanded second edition enables you to build and evaluate sophisticated supervised, unsupervised, and reinforcement learning models. This book introduces end-to-end machine learning for the trading workflow, from the idea and feature engineering to model optimization, strategy design, and backtesting. It illustrates this by using examples ranging from linear models and tree-based ensembles to deep-learning techniques from cutting edge research. This edition shows how to work with market, fundamental, and alternative data, such as tick data, minute and daily bars, SEC filings, earnings call transcripts, financial news, or satellite images to generate tradeable signals. It illustrates how to engineer financial features or alpha factors that enable an ML model to predict returns from price data for US and international stocks and ETFs. It also shows how to assess the signal content of new features using Alphalens and SHAP values and includes a new appendix with over one hundred alpha factor examples. By the end, you will be proficient in translating ML model predictions into a trading strategy that operates at daily or intraday horizons, and in evaluating its performance.

Preface

What to expect

What's new in the second edition

Who should read this book

What this book covers

To get the most out of this book

Get in touch

Machine Learning for Trading – From Idea to Execution

The rise of ML in the investment industry

Designing and executing an ML-driven strategy

ML for trading – strategies and use cases

Summary

Free Chapter

Market and Fundamental Data – Sources and Techniques

Market data reflects its environment

Working with high-frequency data

API access to market data

How to work with fundamental data

Efficient data storage with pandas

Summary

Alternative Data for Finance – Categories and Use Cases

The alternative data revolution

Sources of alternative data

Criteria for evaluating alternative data

The market for alternative data

Working with alternative data

Summary

Financial Feature Engineering – How to Research Alpha Factors

Alpha factors in practice – from data to signals

Building on decades of factor research

Engineering alpha factors that predict returns

From signals to trades – Zipline for backtests

Separating signal from noise with Alphalens

Alpha factor resources

Summary

Portfolio Optimization and Performance Evaluation

How to measure portfolio performance

How to manage portfolio risk and return

Trading and managing portfolios with Zipline

Measuring backtest performance with pyfolio

Summary

The Machine Learning Process

How machine learning from data works

The machine learning workflow

Summary

Linear Models – From Risk Factors to Return Forecasts

From inference to prediction

The baseline model – multiple linear regression

How to run linear regression in practice

How to build a linear factor model

Regularizing linear regression using shrinkage

How to predict returns with linear regression

Linear classification

Summary

The ML4T Workflow – From Model to Strategy Backtesting

How to backtest an ML-driven strategy

Backtesting pitfalls and how to avoid them

How a backtesting engine works

backtrader – a flexible tool for local backtests

Zipline – scalable backtesting by Quantopian

Summary

Time-Series Models for Volatility Forecasts and Statistical Arbitrage

Tools for diagnostics and feature extraction

How to diagnose and achieve stationarity

Univariate time-series models

Multivariate time-series models

Cointegration – time series with a shared trend

Statistical arbitrage with cointegration

Summary

Bayesian ML – Dynamic Sharpe Ratios and Pairs Trading

How Bayesian machine learning works

Probabilistic programming with PyMC3

Bayesian ML for trading

Summary

Random Forests – A Long-Short Strategy for Japanese Stocks

Decision trees – learning rules from data

Random forests – making trees more reliable

Long-short signals for Japanese stocks

Summary

Boosting Your Trading Strategy

Getting started – adaptive boosting

Gradient boosting – ensembles for most tasks

Using XGBoost, LightGBM, and CatBoost

A long-short trading strategy with boosting

Boosting for an intraday strategy

Summary

Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning

Dimensionality reduction

PCA for trading

Clustering

Hierarchical clustering for optimal portfolios

Summary

Text Data for Trading – Sentiment Analysis

ML with text data – from language to features

From text to tokens – the NLP pipeline

Counting tokens – the document-term matrix

NLP for trading

Summary

Topic Modeling – Summarizing Financial News

Learning latent topics – Goals and approaches

Probabilistic latent semantic analysis

Latent Dirichlet allocation

Modeling topics discussed in earnings calls

Topic modeling for with financial news

Summary

Word Embeddings for Earnings Calls and SEC Filings

How word embeddings encode semantics

How to use pretrained word vectors

Custom embeddings for financial news

word2vec for trading with SEC filings

Sentiment analysis using doc2vec embeddings

New frontiers – pretrained transformer models

Summary

Deep Learning for Trading

Deep learning – what's new and why it matters

Designing an NN

A neural network from scratch in Python

Popular deep learning libraries

Optimizing an NN for a long-short strategy

Summary

CNNs for Financial Time Series and Satellite Images

How CNNs learn to model grid-like data

CNNs for satellite images and object detection

CNNs for time-series data – predicting returns

Summary

RNNs for Multivariate Time Series and Sentiment Analysis

How recurrent neural nets work

RNNs for time series with TensorFlow 2

RNNs for text data

Summary

Autoencoders for Conditional Risk Factors and Asset Pricing

Autoencoders for nonlinear feature extraction

Implementing autoencoders with TensorFlow 2

A conditional autoencoder for trading

Summary

Generative Adversarial Networks for Synthetic Time-Series Data

Creating synthetic data with GANs

How to build a GAN using TensorFlow 2

TimeGAN for synthetic financial data

Summary

Deep Reinforcement Learning – Building a Trading Agent

Elements of a reinforcement learning system

How to solve reinforcement learning problems

Solving dynamic programming problems

Q-learning – finding an optimal policy on the go

Deep RL for trading with the OpenAI Gym

Summary

Conclusions and Next Steps

Key takeaways and lessons learned

ML for trading in practice

Conclusion

References

Index

Appendix: Alpha Factor Library

Common alpha factors implemented in TA-Lib

WorldQuant's quest for formulaic alphas

Bivariate and multivariate factor evaluation

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What this book covers

This book provides a comprehensive introduction to how ML can add value to the design and execution of trading strategies. It is organized into four parts that cover different aspects of the data sourcing and strategy development process, as well as different solutions to various ML challenges.

Part 1 – Data, alpha factors, and portfolios

The first part covers fundamental aspects relevant across trading strategies that leverage machine learning. It focuses on the data that drives the ML algorithms and strategies discussed in this book, outlines how you can engineer features that capture the data's signal content, and explains how to optimize and evaluate the performance of a portfolio.

Chapter 1, Machine Learning for Trading – From Idea to Execution, summarizes how and why ML became important for trading, describes the investment process, and outlines how ML can add value.

Chapter 2, Market and Fundamental Data – Sources and Techniques, covers how to source and work with market data, including exchange-provided tick data, and reported financials. It also demonstrates access to numerous open source data providers that we will rely on throughout this book.

Chapter 3, Alternative Data for Finance – Categories and Use Cases, explains categories and criteria to assess the exploding number of sources and providers. It also demonstrates how to create alternative datasets by scraping websites, for example, to collect earnings call transcripts for use with natural language processing (NLP) and sentiment analysis, which we cover in the second part of the book.

Chapter 4, Financial Feature Engineering – How to Research Alpha Factors, presents the process of creating and evaluating data transformations that capture the predictive signal and shows how to measure factor performance. It also summarizes insights from research into risk factors that aim to explain alpha in financial markets otherwise deemed to be efficient. Furthermore, it demonstrates how to engineer alpha factors using Python libraries offline and introduces the Zipline and Alphalens libraries to backtest factors and evaluate their predictive power.

Chapter 5, Portfolio Optimization and Performance Evaluation, introduces how to manage, optimize, and evaluate a portfolio resulting from the execution of a strategy. It presents risk metrics and shows how to apply them using the Zipline and pyfolio libraries. It also introduces methods to optimize a strategy from a portfolio risk perspective.

Part 2 – ML for trading – Fundamentals

The second part illustrates how fundamental supervised and unsupervised learning algorithms can inform trading strategies in the context of an end-to-end workflow.

Chapter 6, The Machine Learning Process, sets the stage by outlining how to formulate, train, tune, and evaluate the predictive performance of ML models in a systematic way. It also addresses domain-specific concerns, such as using cross-validation with financial time series to select among alternative ML models.

Chapter 7, Linear Models – From Risk Factors to Return Forecasts, shows how to use linear and logistic regression for inference and prediction and how to use regularization to manage the risk of overfitting. It demonstrates how to predict US equity returns or the direction of their future movements and how to evaluate the signal content of these predictions using Alphalens.

Chapter 8, The ML4T Workflow – From Model to Strategy Backtesting, integrates the various building blocks of the ML4T workflow thus far discussed separately. It presents an end-to-end perspective on the process of designing, simulating, and evaluating a trading strategy driven by an ML algorithm. To this end, it demonstrates how to backtest an ML-driven strategy in a historical market context using the Python libraries backtrader and Zipline.

Chapter 9, Time-Series Models for Volatility Forecasts and Statistical Arbitrage, covers univariate and multivariate time series diagnostics and models, including vector autoregressive models as well as ARCH/GARCH models for volatility forecasts. It also introduces cointegration and shows how to use it for a pairs trading strategy using a diverse set of exchange-traded funds (ETFs).

Chapter 10, Bayesian ML – Dynamic Sharpe Ratios and Pairs Trading, presents probabilistic models and how Markov chain Monte Carlo (MCMC) sampling and variational Bayes facilitate approximate inference. It also illustrates how to use PyMC3 for probabilistic programming to gain deeper insights into parameter and model uncertainty, for example, when evaluating portfolio performance.

Chapter 11, Random Forests – A Long-Short Strategy for Japanese Stocks, shows how to build, train, and tune nonlinear tree-based models for insight and prediction. It introduces tree-based ensembles and shows how random forests use bootstrap aggregation to overcome some of the weaknesses of decision trees. We then proceed to develop and backtest a long-short strategy for Japanese equities.

Chapter 12, Boosting Your Trading Strategy, introduces gradient boosting and demonstrates how to use the libraries XGBoost, LightBGM, and CatBoost for high-performance training and prediction. It reviews how to tune the numerous hyperparameters and interpret the model using SHapley Additive exPlanation (SHAP) values before building and evaluating a strategy that trades US equities based on LightGBM return forecasts.

Chapter 13, Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning, shows how to use dimensionality reduction and clustering for algorithmic trading. It uses principal and independent component analysis to extract data-driven risk factors and generate eigenportfolios. It presents several clustering techniques and demonstrates the use of hierarchical clustering for asset allocation.

Part 3 – Natural language processing

Part 3 focuses on text data and introduces state-of-the-art unsupervised learning techniques to extract high-quality signals from this key source of alternative data.

Chapter 14, Text Data for Trading – Sentiment Analysis, demonstrates how to convert text data into a numerical format and applies the classification algorithms from Part 2 for sentiment analysis to large datasets.

Chapter 15, Topic Modeling – Summarizing Financial News, uses unsupervised learning to extract topics that summarize a large number of documents and offer more effective ways to explore text data or use topics as features for a classification model. It demonstrates how to apply this technique to earnings call transcripts sourced in Chapter 3 and to annual reports filed with the Securities and Exchange Commission (SEC).

Chapter 16, Word Embeddings for Earnings Calls and SEC Filings, uses neural networks to learn state-of-the-art language features in the form of word vectors that capture semantic context much better than traditional text features and represent a very promising avenue for extracting trading signals from text data.

Part 4 – Deep and reinforcement learning

Part 4 introduces deep learning and reinforcement learning.

Chapter 17, Deep Learning for Trading, introduces TensorFlow 2 and PyTorch, the most popular deep learning frameworks, which we will use throughout Part 4. It presents techniques for training and tuning, including regularization. It also builds and evaluates a trading strategy for US equities.

Chapter 18, CNNs for Financial Time Series and Satellite Images, covers convolutional neural networks (CNNs) that are very powerful for classification tasks with unstructured data at scale. We will introduce successful architectural designs, train a CNN on satellite data (for example, to predict economic activity), and use transfer learning to speed up training. We'll also replicate a recent idea to convert financial time series into a two-dimensional image format to leverage the built-in assumptions of CNNs.

Chapter 19, RNNs for Multivariate Time Series and Sentiment Analysis, shows how recurrent neural networks (RNNs) are useful for sequence-to-sequence modeling, including for univariate and multivariate time series to predict. It demonstrates how RNNs capture nonlinear patterns over longer periods using word embeddings introduced in Chapter 16 to predict returns based on the sentiment expressed in SEC filings.

Chapter 20, Autoencoders for Conditional Risk Factors and Asset Pricing, covers autoencoders for the nonlinear compression of high-dimensional data. It implements a recent paper that uses a deep autoencoder to learn both risk factor returns and factor loadings from the data while conditioning the latter on asset characteristics. We'll create a large US equity dataset with metadata and generate predictive signals.

Chapter 21, Generative Adversarial Networks for Synthetic Time-Series Data, presents one of the most exciting advances in deep learning. Generative adversarial networks (GANs) are capable of learning to reproduce synthetic replicas of a target data type, such as images of celebrities. In addition to images, GANs have also been applied to time-series data. This chapter replicates a novel approach to generate synthetic stock price data that could be used to train an ML model or backtest a strategy, and also evaluate its quality.

Chapter 22, Deep Reinforcement Learning – Building a Trading Agent, presents how reinforcement learning (RL) permits the design and training of agents that learn to optimize decisions over time in response to their environment. You will see how to create a custom trading environment and build an agent that responds to market signals using OpenAI Gym.

Chapter 23, Conclusions and Next Steps, summarizes the lessons learned and outlines several steps you can take to continue learning and building your own trading strategies.

Appendix, Alpha Factor Library, lists almost 200 popular financial features, explains their rationale, and shows how to compute them. It also evaluates and compares their performance in predicting daily stock returns.

Machine Learning for Algorithmic Trading - Second Edition

By : Stefan Jansen

Machine Learning for Algorithmic Trading - Second Edition

By: Stefan Jansen

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning for Algorithmic Trading - Second Edition

Python for Finance Cookbook

Python for Finance Cookbook

Hands-On Deep Learning for Finance

What this book covers

Part 1 – Data, alpha factors, and portfolios

Part 2 – ML for trading – Fundamentals

Part 3 – Natural language processing

Part 4 – Deep and reinforcement learning