Book Image

Hands-On Machine Learning for Algorithmic Trading

By : Stefan Jansen
Book Image

Hands-On Machine Learning for Algorithmic Trading

By: Stefan Jansen

Overview of this book

The explosive growth of digital data has boosted the demand for expertise in trading strategies that use machine learning (ML). This book enables you to use a broad range of supervised and unsupervised algorithms to extract signals from a wide variety of data sources and create powerful investment strategies. This book shows how to access market, fundamental, and alternative data via API or web scraping and offers a framework to evaluate alternative data. You’ll practice the ML work?ow from model design, loss metric definition, and parameter tuning to performance evaluation in a time series context. You will understand ML algorithms such as Bayesian and ensemble methods and manifold learning, and will know how to train and tune these models using pandas, statsmodels, sklearn, PyMC3, xgboost, lightgbm, and catboost. This book also teaches you how to extract features from text data using spaCy, classify news and assign sentiment scores, and to use gensim to model topics and learn word embeddings from financial reports. You will also build and evaluate neural networks, including RNNs and CNNs, using Keras and PyTorch to exploit unstructured data for sophisticated strategies. Finally, you will apply transfer learning to satellite images to predict economic activity and use reinforcement learning to build agents that learn to trade in the OpenAI Gym.
Table of Contents (23 chapters)

ML and algorithmic trading strategies

Quantitative strategies have evolved and become more sophisticated in three waves:

  1. In the 1980s and 1990s, signals often emerged from academic research and used a single or very few inputs derived from market and fundamental data. These signals are now largely commoditized and available as ETF, such as basic mean-reversion strategies.
  2. In the 2000s, factor-based investing proliferated. Funds used algorithms to identify assets exposed to risk factors like value or momentum to seek arbitrage opportunities. Redemptions during the early days of the financial crisis triggered the quant quake of August 2007 that cascaded through the factor-based fund industry. These strategies are now also available as long-only smart-beta funds that tilt portfolios according to a given set of risk factors.
  3. The third era is driven by investments in ML capabilities and alternative data to generate profitable signals for repeatable trading strategies. Factor decay is a major challenge: the excess returns from new anomalies have been shown to drop by a quarter from discovery to publication, and by over 50% after publication due to competition and crowding.

There are several categories of trading strategies that use algorithms to execute trading rules:

  • Short-term trades that aim to profit from small price movements, for example, due to arbitrage
  • Behavioral strategies that aim to capitalize on anticipating the behavior of other market participants
  • Programs that aim to optimize trade execution, and
  • A large group of trading based on predicted pricing

The HFT funds discussed above most prominently rely on short holding periods to benefit from minor price movements based on bid-ask arbitrage or statistical arbitrage. Behavioral algorithms usually operate in lower liquidity environments and aim to anticipate moves by a larger player likely to significantly impact the price. The expectation of the price impact is based on sniffing algorithms that generate insights into other market participants' strategies, or market patterns such as forced trades by ETFs.

Trade-execution programs aim to limit the market impact of trades and range from the simple slicing of trades to match time-weighted average pricing (TWAP) or volume-weighted average pricing (VWAP). Simple algorithms leverage historical patterns, whereas more sophisticated algorithms take into account transaction costs, implementation shortfall or predicted price movements. These algorithms can operate at the security or portfolio level, for example, to implement multileg derivative or cross-asset trades.

Use Cases of ML for Trading

ML extracts signals from a wide range of market, fundamental, and alternative data, and can be applied at all steps of the algorithmic trading-strategy process. Key applications include:

  • Data mining to identify patterns and extract features
  • Supervised learning to generate risk factors or alphas and create trade ideas
  • Aggregation of individual signals into a strategy
  • Allocation of assets according to risk profiles learned by an algorithm
  • The testing and evaluation of strategies, including through the use of synthetic data
  • The interactive, automated refinement of a strategy using reinforcement learning

We briefly highlight some of these applications and identify where we will demonstrate their use in later chapters.

Data mining for feature extraction

The cost-effective evaluation of large, complex datasets requires the detection of signals at scale. There are several examples throughout the book:

  • Information theory is a useful tool to extract features that capture potential signals and can be used in ML models. In Chapter 4, Alpha Factor Research we use mutual information to assess the potential values of individual features for a supervised learning algorithm to predict asset returns.
  • In Chapter 12, Unsupervised Learning, we introduce various techniques to create features from high-dimensional datasets. In Chapter 14, Topic Modeling, we apply these techniques to text data.
  • We emphasize model-specific ways to gain insights into the predictive power of individual variables. We use a novel game-theoretic approach called SHapley Additive exPlanations (SHAP) to attribute predictive performance to individual features in complex Gradient Boosting machines with a large number of input variables.

Supervised learning for alpha factor creation and aggregation

The main rationale for applying ML to trading is to obtain predictions of asset fundamentals, price movements or market conditions. A strategy can leverage multiple ML algorithms that build on each other. Downstream models can generate signals at the portfolio level by integrating predictions about the prospects of individual assets, capital market expectations, and the correlation among securities. Alternatively, ML predictions can inform discretionary trades as in the quantamental approach outlined above. ML predictions can also target specific risk factors, such as value or volatility, or implement technical approaches, such as trend following or mean reversion:

  • In Chapter 3, Alternative Data for Finance, we illustrate how to work with fundamental data to create inputs to ML-driven valuation models
  • In Chapter 13, Working with Text Data, Chapter 14, Topic Modeling, and Chapter 15, Word Embeddings we use alternative data on business reviews that can be used to project revenues for a company as an input for a valuation exercise.
  • In Chapter 8, Time Series Models, we demonstrate how to forecast macro variables as inputs to market expectations and how to forecast risk factors such as volatility
  • In Chapter 18, Recurrent Neural Networks we introduce recurrent neural networks (RNNs) that achieve superior performance with non-linear time series data.

Asset allocation

ML has been used to allocate portfolios based on decision-tree models that compute a hierarchical form of risk parity. As a result, risk characteristics are driven by patterns in asset prices rather than by asset classes and achieve superior risk-return characteristics.

In Chapter 5, Strategy Evaluation and Chapter 12, Unsupervised Learning, we illustrate how hierarchical clustering extracts data-driven risk classes that better reflect correlation patterns than conventional asset class definition.

Testing trade ideas

Backtesting is a critical step to select successful algorithmic trading strategies. Cross-validation using synthetic data is a key ML technique to generate reliable out-of-sample results when combined with appropriate methods to correct for multiple testing. The time series nature of financial data requires modifications to the standard approach to avoid look-ahead bias or otherwise contaminate the data used for training, validation, and testing. In addition, the limited availability of historical data has given rise to alternative approaches that use synthetic data:

  • We will demonstrate various methods to test ML models using market, fundamental, and alternative that obtain sound estimates of out-of-sample errors.
  • In Chapter 20, Autoencoders and Generative Adversarial Nets, we present GAN that are capable of producing high-quality synthetic data.

Reinforcement learning

Trading takes place in a competitive, interactive marketplace. Reinforcement learning aims to train agents to learn a policy function based on rewards.

  • In Chapter 21, Reinforcement Learning we present key reinforcement algorithms like Q-Learning and the Dyna architecture and demonstrate the training of reinforcement algorithms for trading using OpenAI's gym environment.