Machine Learning for Algorithmic Trading - Second Edition

By : Stefan Jansen

Machine Learning for Algorithmic Trading - Second Edition

By: Stefan Jansen

Overview of this book

The explosive growth of digital data has boosted the demand for expertise in trading strategies that use machine learning (ML). This revised and expanded second edition enables you to build and evaluate sophisticated supervised, unsupervised, and reinforcement learning models. This book introduces end-to-end machine learning for the trading workflow, from the idea and feature engineering to model optimization, strategy design, and backtesting. It illustrates this by using examples ranging from linear models and tree-based ensembles to deep-learning techniques from cutting edge research. This edition shows how to work with market, fundamental, and alternative data, such as tick data, minute and daily bars, SEC filings, earnings call transcripts, financial news, or satellite images to generate tradeable signals. It illustrates how to engineer financial features or alpha factors that enable an ML model to predict returns from price data for US and international stocks and ETFs. It also shows how to assess the signal content of new features using Alphalens and SHAP values and includes a new appendix with over one hundred alpha factor examples. By the end, you will be proficient in translating ML model predictions into a trading strategy that operates at daily or intraday horizons, and in evaluating its performance.

Preface

What to expect

What's new in the second edition

Who should read this book

What this book covers

To get the most out of this book

Get in touch

Machine Learning for Trading – From Idea to Execution

The rise of ML in the investment industry

Designing and executing an ML-driven strategy

ML for trading – strategies and use cases

Summary

Free Chapter

Market and Fundamental Data – Sources and Techniques

Market data reflects its environment

Working with high-frequency data

API access to market data

How to work with fundamental data

Efficient data storage with pandas

Summary

Alternative Data for Finance – Categories and Use Cases

The alternative data revolution

Sources of alternative data

Criteria for evaluating alternative data

The market for alternative data

Working with alternative data

Summary

Financial Feature Engineering – How to Research Alpha Factors

Alpha factors in practice – from data to signals

Building on decades of factor research

Engineering alpha factors that predict returns

From signals to trades – Zipline for backtests

Separating signal from noise with Alphalens

Alpha factor resources

Summary

Portfolio Optimization and Performance Evaluation

How to measure portfolio performance

How to manage portfolio risk and return

Trading and managing portfolios with Zipline

Measuring backtest performance with pyfolio

Summary

The Machine Learning Process

How machine learning from data works

The machine learning workflow

Summary

Linear Models – From Risk Factors to Return Forecasts

From inference to prediction

The baseline model – multiple linear regression

How to run linear regression in practice

How to build a linear factor model

Regularizing linear regression using shrinkage

How to predict returns with linear regression

Linear classification

Summary

The ML4T Workflow – From Model to Strategy Backtesting

How to backtest an ML-driven strategy

Backtesting pitfalls and how to avoid them

How a backtesting engine works

backtrader – a flexible tool for local backtests

Zipline – scalable backtesting by Quantopian

Summary

Time-Series Models for Volatility Forecasts and Statistical Arbitrage

Tools for diagnostics and feature extraction

How to diagnose and achieve stationarity

Univariate time-series models

Multivariate time-series models

Cointegration – time series with a shared trend

Statistical arbitrage with cointegration

Summary

Bayesian ML – Dynamic Sharpe Ratios and Pairs Trading

How Bayesian machine learning works

Probabilistic programming with PyMC3

Bayesian ML for trading

Summary

Random Forests – A Long-Short Strategy for Japanese Stocks

Decision trees – learning rules from data

Random forests – making trees more reliable

Long-short signals for Japanese stocks

Summary

Boosting Your Trading Strategy

Getting started – adaptive boosting

Gradient boosting – ensembles for most tasks

Using XGBoost, LightGBM, and CatBoost

A long-short trading strategy with boosting

Boosting for an intraday strategy

Summary

Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning

Dimensionality reduction

PCA for trading

Clustering

Hierarchical clustering for optimal portfolios

Summary

Text Data for Trading – Sentiment Analysis

ML with text data – from language to features

From text to tokens – the NLP pipeline

Counting tokens – the document-term matrix

NLP for trading

Summary

Topic Modeling – Summarizing Financial News

Learning latent topics – Goals and approaches

Probabilistic latent semantic analysis

Latent Dirichlet allocation

Modeling topics discussed in earnings calls

Topic modeling for with financial news

Summary

Word Embeddings for Earnings Calls and SEC Filings

How word embeddings encode semantics

How to use pretrained word vectors

Custom embeddings for financial news

word2vec for trading with SEC filings

Sentiment analysis using doc2vec embeddings

New frontiers – pretrained transformer models

Summary

Deep Learning for Trading

Deep learning – what's new and why it matters

Designing an NN

A neural network from scratch in Python

Popular deep learning libraries

Optimizing an NN for a long-short strategy

Summary

CNNs for Financial Time Series and Satellite Images

How CNNs learn to model grid-like data

CNNs for satellite images and object detection

CNNs for time-series data – predicting returns

Summary

RNNs for Multivariate Time Series and Sentiment Analysis

How recurrent neural nets work

RNNs for time series with TensorFlow 2

RNNs for text data

Summary

Autoencoders for Conditional Risk Factors and Asset Pricing

Autoencoders for nonlinear feature extraction

Implementing autoencoders with TensorFlow 2

A conditional autoencoder for trading

Summary

Generative Adversarial Networks for Synthetic Time-Series Data

Creating synthetic data with GANs

How to build a GAN using TensorFlow 2

TimeGAN for synthetic financial data

Summary

Deep Reinforcement Learning – Building a Trading Agent

Elements of a reinforcement learning system

How to solve reinforcement learning problems

Solving dynamic programming problems

Q-learning – finding an optimal policy on the go

Deep RL for trading with the OpenAI Gym

Summary

Conclusions and Next Steps

Key takeaways and lessons learned

ML for trading in practice

Conclusion

References

Index

Appendix: Alpha Factor Library

Common alpha factors implemented in TA-Lib

WorldQuant's quest for formulaic alphas

Bivariate and multivariate factor evaluation

Customer Reviews

5 star

4 star

3 star

2 star

1 star

How to work with fundamental data

Fundamental data pertains to the economic drivers that determine the value of securities. The nature of the data depends on the asset class:

For equities and corporate credit, it includes corporate financials, as well as industry and economy-wide data.
For government bonds, it includes international macro data and foreign exchange.
For commodities, it includes asset-specific supply-and-demand determinants, such as weather data for crops.

We will focus on equity fundamentals for the U.S., where data is easier to access. There are some 13,000+ public companies worldwide that generate 2 million pages of annual reports and more than 30,000 hours of earnings calls. In algorithmic trading, fundamental data and features engineered from this data may be used to derive trading signals directly, for example, as value indicators, and are an essential input for predictive models, including ML models.

Financial statement data

The Securities and Exchange Commission (SEC) requires U.S. issuers—that is, listed companies and securities, including mutual funds—to file three quarterly financial statements (Form 10-Q) and one annual report (Form 10-K), in addition to various other regulatory filing requirements.

Since the early 1990s, the SEC made these filings available through its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. They constitute the primary data source for the fundamental analysis of equity and other securities, such as corporate credit, where the value depends on the business prospects and financial health of the issuer.

Automated processing – XBRL

Automated analysis of regulatory filings has become much easier since the SEC introduced XBRL, which is a free, open, and global standard for the electronic representation and exchange of business reports. XBRL is based on XML; it relies on taxonomies that define the meaning of the elements of a report and map to tags that highlight the corresponding information in the electronic version of the report. One such taxonomy represents the U.S. Generally Accepted Accounting Principles (GAAP).

The SEC introduced voluntary XBRL filings in 2005 in response to accounting scandals before requiring this format for all filers as of 2009, and it continues to expand the mandatory coverage to other regulatory filings. The SEC maintains a website that lists the current taxonomies that shape the content of different filings and can be used to extract specific items.

The following datasets provide information extracted from EX-101 attachments submitted to the commission in a flattened data format to assist users in consuming data for analysis. The data reflects selected information from the XBRL-tagged financial statements. It currently includes numeric data from the quarterly and annual financial statements, as well as certain additional fields, for example, Standard Industrial Classification (SIC).

There are several avenues to track and access fundamental data reported to the SEC:

As part of the EDGAR Public Dissemination Service (PDS), electronic feeds of accepted filings are available for a fee.
The SEC updates the RSS feeds, which list the structured disclosure submissions, every 10 minutes.
There are public index files for the retrieval of all filings through FTP for automated processing.
The financial statement (and notes) datasets contain parsed XBRL data from all financial statements and the accompanying notes.

The SEC also publishes log files containing the internet search traffic for EDGAR filings through SEC.gov, albeit with a six month delay.

Building a fundamental data time series

The scope of the data in the financial statement and notes datasets consists of numeric data extracted from the primary financial statements (balance sheet, income statement, cash flows, changes in equity, and comprehensive income) and footnotes on those statements. The available data is from as early as 2009.

Extracting the financial statements and notes dataset

The following code downloads and extracts all historical filings contained in the financial statement and notes (FSN) datasets for the given range of quarters (refer to edgar_xbrl.ipynb for additional details):

SEC_URL = 'https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/'
first_year, this_year, this_quarter = 2014, 2018, 3
past_years = range(2014, this_year)
filing_periods = [(y, q) for y in past_years for q in range(1, 5)]
filing_periods.extend([(this_year, q) for q in range(1, this_quarter + 
                                                    1)])
for i, (yr, qtr) in enumerate(filing_periods, 1):
    filing = f'{yr}q{qtr}_notes.zip'
    path = data_path / f'{yr}_{qtr}' / 'source'
    response = requests.get(SEC_URL + filing).content
    with ZipFile(BytesIO(response)) as zip_file:
        for file in zip_file.namelist():
            local_file = path / file
            with local_file.open('wb') as output:
                for line in zip_file.open(file).readlines():
                    output.write(line)

The data is fairly large, and to enable faster access than the original text files permit, it is better to convert the text files into a binary, Parquet columnar format (refer to the Efficient data storage with pandas section later in this chapter for a performance comparison of various data-storage options that are compatible with pandas DataFrames):

for f in data_path.glob('**/*.tsv'):
    file_name = f.stem  + '.parquet'
    path = Path(f.parents[1]) / 'parquet'
    df = pd.read_csv(f, sep='\t', encoding='latin1', low_memory=False)
    df.to_parquet(path / file_name)

For each quarter, the FSN data is organized into eight file sets that contain information about submissions, numbers, taxonomy tags, presentation, and more. Each dataset consists of rows and fields and is provided as a tab-delimited text file:

File	Dataset	Description
`SUB`	Submission	Identifies each XBRL submission by company, form, date, and so on
`TAG`	Tag	Defines and explains each taxonomy tag
`DIM`	Dimension	Adds detail to numeric and plain text data
`NUM`	Numeric	One row for each distinct data point in filing
`TXT`	Plain text	Contains all non-numeric XBRL fields
`REN`	Rendering	Information for rendering on the SEC website
`PRE`	Presentation	Details of tag and number presentation in primary statements
`CAL`	Calculation	Shows the arithmetic relationships among tags

Retrieving all quarterly Apple filings

The submission dataset contains the unique identifiers required to retrieve the filings: the Central Index Key (CIK) and the Accession Number (adsh). The following shows some of the information about Apple's 2018Q1 10-Q filing:

apple = sub[sub.name == 'APPLE INC'].T.dropna().squeeze()
key_cols = ['name', 'adsh', 'cik', 'name', 'sic', 'countryba',  
            'stprba', 'cityba', 'zipba', 'bas1', 'form', 'period', 
            'fy', 'fp', 'filed']
apple.loc[key_cols]
name                    APPLE INC
adsh                    0000320193-18-000070
cik                     320193
name                    APPLE INC
sic                     3571
countryba               US
stprba                  CA
cityba                  CUPERTINO
zipba                   95014
bas1                    ONE APPLE PARK WAY
form                    10-Q
period                  20180331
fy                      2018
fp                      Q2
filed                   20180502

Using the CIK, we can identify all of the historical quarterly filings available for Apple and combine this information to obtain 26 10-Q forms and 9 annual 10-K forms:

aapl_subs = pd.DataFrame()
for sub in data_path.glob('**/sub.parquet'):
    sub = pd.read_parquet(sub)
    aapl_sub = sub[(sub.cik.astype(int) == apple.cik) & 
                   (sub.form.isin(['10-Q', '10-K']))]
    aapl_subs = pd.concat([aapl_subs, aapl_sub])
aapl_subs.form.value_counts()
10-Q    15
10-K     4

With the accession number for each filing, we can now rely on the taxonomies to select the appropriate XBRL tags (listed in the TAG file) from the NUM and TXT files to obtain the numerical or textual/footnote data points of interest.

First, let's extract all of the numerical data that is available from the 19 Apple filings:

aapl_nums = pd.DataFrame()
for num in data_path.glob('**/num.parquet'):
    num = pd.read_parquet(num).drop('dimh', axis=1)
    aapl_num = num[num.adsh.isin(aapl_subs.adsh)]
    aapl_nums = pd.concat([aapl_nums, aapl_num])
aapl_nums.ddate = pd.to_datetime(aapl_nums.ddate, format='%Y%m%d')
aapl_nums.shape
(28281, 16)

Building a price/earnings time series

In total, the 9 years of filing history provide us with over 28,000 numerical values. We can select a useful field, such as earnings per diluted share (EPS), that we can combine with market data to calculate the popular price-to-earnings (P/E) valuation ratio.

We do need to take into account, however, that Apple split its stock by 7:1 on June 4, 2014, and adjust the earnings per share values before the split to make the earnings comparable to the price data, which, in its adjusted form, accounts for these changes. The following code block shows you how to adjust the earnings data:

field = 'EarningsPerShareDiluted'
stock_split = 7
split_date = pd.to_datetime('20140604')
# Filter by tag; keep only values measuring 1 quarter
eps = aapl_nums[(aapl_nums.tag == 'EarningsPerShareDiluted')
                & (aapl_nums.qtrs == 1)].drop('tag', axis=1)
# Keep only most recent data point from each filing
eps = eps.groupby('adsh').apply(lambda x: x.nlargest(n=1, columns=['ddate']))
# Adjust earnings prior to stock split downward
eps.loc[eps.ddate < split_date,'value'] = eps.loc[eps.ddate < 
        split_date, 'value'].div(7)
eps = eps[['ddate', 'value']].set_index('ddate').squeeze()
# create trailing 12-months eps from quarterly data
eps = eps.rolling(4, min_periods=4).sum().dropna()

We can use Quandl to obtain Apple stock price data since 2009:

import pandas_datareader.data as web
symbol = 'AAPL.US'
aapl_stock = web.DataReader(symbol, 'quandl', start=eps.index.min())
aapl_stock = aapl_stock.resample('D').last() # ensure dates align with 
                                               eps data

Now we have the data to compute the trailing 12-month P/E ratio for the entire period:

pe = aapl_stock.AdjClose.to_frame('price').join(eps.to_frame('eps'))
pe = pe.fillna(method='ffill').dropna()
pe['P/E Ratio'] = pe.price.div(pe.eps)
axes = pe.plot(subplots=True, figsize=(16,8), legend=False, lw=2);

We get the following plot from the preceding code:

Figure 2.11: Trailing P/E ratio from EDGAR filings

Other fundamental data sources

There are numerous other sources for fundamental data. Many are accessible using the pandas_datareader module that was introduced earlier. Additional data is available from certain organizations directly, such as the IMF, the World Bank, or major national statistical agencies around the world (refer to the references section on GitHub).

pandas-datareader – macro and industry data

The pandas-datareader library facilitates access according to the conventions introduced at the end of the preceding section on market data. It covers APIs for numerous global fundamental macro- and industry-data sources, including the following:

Kenneth French's data library: Market data on portfolios capturing returns on key risk factors like size, value, and momentum factors, disaggregated by industry (refer to Chapter 4, Financial Feature Engineering – How to Research Alpha Factors)
St. Louis FED (FRED): Federal Reserve data on the U.S. economy and financial markets
World Bank: Global database on long-term, lower-frequency economic and social development and demographics
OECD: Similar to the World Bank data for OECD countries
Enigma: Various datasets, including alternative sources
Eurostat: EU-focused economic, social, and demographic data

Machine Learning for Algorithmic Trading - Second Edition

By : Stefan Jansen

Machine Learning for Algorithmic Trading - Second Edition

By: Stefan Jansen

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning for Algorithmic Trading - Second Edition

Python for Finance Cookbook

Python for Finance Cookbook

Hands-On Deep Learning for Finance

How to work with fundamental data

Financial statement data

Automated processing – XBRL

Building a fundamental data time series

Extracting the financial statements and notes dataset

Retrieving all quarterly Apple filings

Building a price/earnings time series

Other fundamental data sources

pandas-datareader – macro and industry data