Book Image

Machine Learning for Algorithmic Trading - Second Edition

By : Stefan Jansen
Book Image

Machine Learning for Algorithmic Trading - Second Edition

By: Stefan Jansen

Overview of this book

The explosive growth of digital data has boosted the demand for expertise in trading strategies that use machine learning (ML). This revised and expanded second edition enables you to build and evaluate sophisticated supervised, unsupervised, and reinforcement learning models. This book introduces end-to-end machine learning for the trading workflow, from the idea and feature engineering to model optimization, strategy design, and backtesting. It illustrates this by using examples ranging from linear models and tree-based ensembles to deep-learning techniques from cutting edge research. This edition shows how to work with market, fundamental, and alternative data, such as tick data, minute and daily bars, SEC filings, earnings call transcripts, financial news, or satellite images to generate tradeable signals. It illustrates how to engineer financial features or alpha factors that enable an ML model to predict returns from price data for US and international stocks and ETFs. It also shows how to assess the signal content of new features using Alphalens and SHAP values and includes a new appendix with over one hundred alpha factor examples. By the end, you will be proficient in translating ML model predictions into a trading strategy that operates at daily or intraday horizons, and in evaluating its performance.
Table of Contents (27 chapters)
24
References
25
Index

API access to market data

There are several options you can use to access market data via an API using Python. We will first present a few sources built into the pandas library and the yfinance tool that facilitates the downloading of end-of-day market data and recent fundamental data from Yahoo! Finance.

Then we will briefly introduce the trading platform Quantopian, the data provider Quandl, and the Zipline backtesting library that we will use later in the book, as well as listing several additional options to access various types of market data. The directory data_providers on GitHub contains several notebooks that illustrate the usage of these options.

Remote data access using pandas

The pandas library enables access to data displayed on websites using the read_html function and access to the API endpoints of various data providers through the related pandas-datareader library.

Reading HTML tables

Downloading the content of one or more HTML tables, such as for the constituents of the S&P 500 index from Wikipedia, works as follows:

sp_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
sp = pd.read_html(sp_url, header=0)[0] # returns a list for each table
sp.info()
RangeIndex: 505 entries, 0 to 504
Data columns (total 9 columns):
Symbol                    505 non-null object
Security                  505 non-null object
SEC filings                505 non-null object
GICS Sector               505 non-null object
GICS Sub Industry         505 non-null object
Headquarters Location     505 non-null object
Date first added           408 non-null object
CIK                       505 non-null int64
Founded                   234 non-null object

pandas-datareader for market data

pandas used to facilitate access to data provider APIs directly, but this functionality has moved to the pandas-datareader library (refer to the README for links to the documentation).

The stability of the APIs varies with provider policies and continues to change. Please consult the documentation for up-to-date information. As of December 2019, at version 0.8.1, the following sources are available:

Source

Scope

Comment

Tiingo

Historical end-of-day prices on equities, mutual funds, and ETF.

Free registration for the API key. Free accounts can access only 500 symbols.

Investor Exchange (IEX)

Historical stock prices are available if traded on IEX.

Requires an API key from IEX Cloud Console.

Alpha Vantage

Historical equity data for daily, weekly, and monthly frequencies, 20+ years, and the past 3-5 days of intraday data. It also has FOREX and sector performance data.

Quandl

Free data sources as listed on their website.

Fama/French

Risk factor portfolio returns.

Used in Chapter 7, Linear Models – From Risk Factors to Return Forecasts.

TSP Fund Data

Mutual fund prices.

Nasdaq

Latest metadata on traded tickers.

Stooq Index Data

Some equity indices are not available from elsewhere due to licensing issues.

MOEX

Moscow Exchange historical data.

The access and retrieval of data follow a similar API for all sources, as illustrated for Yahoo! Finance:

import pandas_datareader.data as web
from datetime import datetime
start = '2014'              # accepts strings
end = datetime(2017, 5, 24) # or datetime objects
yahoo= web.DataReader('FB', 'yahoo', start=start, end=end)
yahoo.info()
DatetimeIndex: 856 entries, 2014-01-02 to 2017-05-25
Data columns (total 6 columns):
High         856 non-null float64
Low          856 non-null float64
Open         856 non-null float64
Close        856 non-null float64
Volume       856 non-null int64
Adj Close    856 non-null float64
dtypes: float64(5), int64(1)

yfinance – scraping data from Yahoo! Finance

yfinance aims to provide a reliable and fast way to download historical market data from Yahoo! Finance. The library was originally named fix-yahoo-finance. The usage of this library is very straightforward; the notebook yfinance_demo illustrates the library's capabilities.

How to download end-of-day and intraday prices

The Ticker object permits the downloading of various data points scraped from Yahoo's website:

import yfinance as yf
symbol = 'MSFT'
ticker = yf.Ticker(symbol)

The .history method obtains historical prices for various periods, from one day to the maximum available, and at different frequencies, whereas intraday is only available for the last several days. To download adjusted OHLCV data at a one-minute frequency and corporate actions, use:

data = ticker.history(period='5d',
                      interval='1m',
                      actions=True,
                      auto_adjust=True)
data.info()
DatetimeIndex: 1747 entries, 2019-11-22 09:30:00-05:00 to 2019-11-29 13:00:00-05:00
Data columns (total 7 columns):
Open            1747 non-null float64
High            1747 non-null float64
Low             1747 non-null float64
Close           1747 non-null float64
Volume          1747 non-null int64
Dividends       1747 non-null int64
Stock Splits    1747 non-null int64

The notebook also illustrates how to access quarterly and annual financial statements, sustainability scores, analyst recommendations, and upcoming earnings dates.

How to download the option chain and prices

yfinance also provides access to the option expiration dates and prices and other information for various contracts. Using the ticker instance from the previous example, we get the expiration dates using:

ticker.options
('2019-12-05',  '2019-12-12',  '2019-12-19',..)

For any of these dates, we can access the option chain and view details for the various put/call contracts as follows:

options = ticker.option_chain('2019-12-05')
options.calls.info()
Data columns (total 14 columns):
contractSymbol       35 non-null object
lastTradeDate        35 non-null datetime64[ns]
strike               35 non-null float64
lastPrice            35 non-null float64
bid                  35 non-null float64
ask                  35 non-null float64
change               35 non-null float64
percentChange        35 non-null float64
volume               34 non-null float64
openInterest         35 non-null int64
impliedVolatility    35 non-null float64
inTheMoney           35 non-null bool
contractSize         35 non-null object
currency             35 non-null object

The library also permits the use of proxy servers to prevent rate limiting and facilitates the bulk downloading of multiple tickers. The notebook demonstrates the usage of these features as well.

Quantopian

Quantopian is an investment firm that offers a research platform to crowd-source trading algorithms. Registration is free, and members can research trading ideas using a broad variety of data sources. It also offers an environment to backtest the algorithm against historical data, as well as to forward-test it out of sample with live data. It awards investment allocations for top-performing algorithms whose authors are entitled to a 10 percent (at the time of writing) profit share.

The Quantopian research platform consists of a Jupyter Notebook environment for research and development for alpha-factor research and performance analysis. There is also an interactive development environment (IDE) for coding algorithmic strategies and backtesting the result using historical data since 2002 with minute-bar frequency.

Users can also simulate algorithms with live data, which is known as paper trading. Quantopian provides various market datasets, including U.S. equity and futures price and volume data at a one-minute frequency, and U.S. equity corporate fundamentals, and it also integrates numerous alternative datasets.

We will dive into the Quantopian platform in much more detail in Chapter 4, Financial Feature Engineering – How to Research Alpha Factors, and rely on its functionality throughout the book, so feel free to open an account right away. (Refer to the GitHub repository for more details.)

Zipline

Zipline is the algorithmic trading library that powers the Quantopian backtesting and live-trading platform. It is also available offline to develop a strategy using a limited number of free data bundles that can be ingested and used to test the performance of trading ideas before porting the result to the online Quantopian platform for paper and live trading.

Zipline requires a custom environment—view the instructions at the beginning of the notebook zipline_data_demo.ipynb The following code illustrates how Zipline permits us to access daily stock data for a range of companies. You can run Zipline scripts in the Jupyter Notebook using the magic function of the same name.

First, you need to initialize the context with the desired security symbols. We'll also use a counter variable. Then, Zipline calls handle_data, where we use the data.history() method to look back a single period and append the data for the last day to a .csv file:

%load_ext zipline
%%zipline --start 2010-1-1 --end 2018-1-1 --data-frequency daily
from zipline.api import order_target, record, symbol
def initialize(context):
    context.i = 0
    context.assets = [symbol('FB'), symbol('GOOG'), symbol('AMZN')]
 
def handle_data(context, data):
    df = data.history(context.assets, fields=['price', 'volume'], 
                      bar_count=1, frequency="1d")
    df = df.to_frame().reset_index()
 
    if context.i == 0:
        df.columns = ['date', 'asset', 'price', 'volume']
        df.to_csv('stock_data.csv', index=False)
    else:
        df.to_csv('stock_data.csv', index=False, mode='a', header=None)
    context.i += 1
df = pd.read_csv('stock_data.csv')
df.date = pd.to_datetime(df.date)
df.set_index('date').groupby('asset').price.plot(lw=2, legend=True, 
       figsize=(14, 6));

We get the following plot for the preceding code:

Figure 2.9: Zipline data access

We will explore the capabilities of Zipline, and especially the online Quantopian platform, in more detail in the coming chapters.

Quandl

Quandl provides a broad range of data sources, both free and as a subscription, using a Python API. Register and obtain a free API key to make more than 50 calls per day. Quandl data covers multiple asset classes beyond equities and includes FX, fixed income, indexes, futures and options, and commodities.

API usage is straightforward, well-documented, and flexible, with numerous methods beyond single-series downloads, for example, including bulk downloads or metadata searches.

The following call obtains oil prices from 1986 onward, as quoted by the U.S. Department of Energy:

import quandl
oil = quandl.get('EIA/PET_RWTC_D').squeeze()
oil.plot(lw=2, title='WTI Crude Oil Price')

We get this plot for the preceding code:

Figure 2.10: Quandl oil price example

Other market data providers

A broad variety of providers offer market data for various asset classes. Examples in relevant categories include:

  • Exchanges derive a growing share of their revenues from an ever-broader range of data services, typically using a subscription.
  • Bloomberg and Thomson Reuters have long been the leading data aggregators with a combined share of over 55 percent in the $28.5 billion financial data market. Smaller rivals, such as FactSet, are growing or emerging, such as money.net, Quandl, Trading Economics, and Barchart.
  • Specialist data providers abound. One example is LOBSTER, which aggregates Nasdaq order-book data in real time.
  • Free data providers include Alpha Vantage, which offers Python APIs for real-time equity, FX, and cryptocurrency market data, as well as technical indicators.
  • Crowd-sourced investment firms that provide research platforms with data access include, in addition to Quantopian, Alpha Trading Labs, launched in March 2018, which provides HFT infrastructure and data.