Book Image

Hands-On Machine Learning for Algorithmic Trading

By : Stefan Jansen

Book Image

Hands-On Machine Learning for Algorithmic Trading

By: Stefan Jansen

Overview of this book

The explosive growth of digital data has boosted the demand for expertise in trading strategies that use machine learning (ML). This book enables you to use a broad range of supervised and unsupervised algorithms to extract signals from a wide variety of data sources and create powerful investment strategies. This book shows how to access market, fundamental, and alternative data via API or web scraping and offers a framework to evaluate alternative data. You’ll practice the ML work?ow from model design, loss metric definition, and parameter tuning to performance evaluation in a time series context. You will understand ML algorithms such as Bayesian and ensemble methods and manifold learning, and will know how to train and tune these models using pandas, statsmodels, sklearn, PyMC3, xgboost, lightgbm, and catboost. This book also teaches you how to extract features from text data using spaCy, classify news and assign sentiment scores, and to use gensim to model topics and learn word embeddings from financial reports. You will also build and evaluate neural networks, including RNNs and CNNs, using Keras and PyTorch to exploit unstructured data for sophisticated strategies. Finally, you will apply transfer learning to satellite images to predict economic activity and use reinforcement learning to build agents that learn to trade in the OpenAI Gym.

Preface

Who this book is for

What this book covers

To get the most out of this book

Free Chapter

Machine Learning for Trading

Machine Learning for Trading

How to read this book

The rise of ML in the investment industry

Design and execution of a trading strategy

ML and algorithmic trading strategies

Market and Fundamental Data

Market and Fundamental Data

How to work with market data

How to work with fundamental data

Efficient data storage with pandas

Alternative Data for Finance

Alternative Data for Finance

The alternative data revolution

Evaluating alternative datasets

The market for alternative data

Working with alternative data

Alpha Factor Research

Alpha Factor Research

Engineering alpha factors

Seeking signals – how to use zipline

Separating signal and noise – how to use alphalens

Alpha factor resources

Strategy Evaluation

Strategy Evaluation

How to build and test a portfolio with zipline

How to measure performance with pyfolio

How to avoid the pitfalls of backtesting

How to manage portfolio risk and return

The Machine Learning Process

The Machine Learning Process

Learning from data

The machine learning workflow

Linear Models

Linear regression for inference and prediction

The multiple linear regression model

How to build a linear factor model

Shrinkage methods – regularization for linear regression

How to use linear regression to predict returns

Linear classification

Time Series Models

Time Series Models

Analytical tools for diagnostics and feature extraction

Univariate time series models

Multivariate time series models

Bayesian Machine Learning

Bayesian Machine Learning

How Bayesian machine learning works

Probabilistic programming with PyMC3

Decision Trees and Random Forests

Decision Trees and Random Forests

Gradient Boosting Machines

Gradient Boosting Machines

Adaptive boosting

Gradient boosting machines

Fast scalable GBM implementations

How to interpret GBM results

Unsupervised Learning

Unsupervised Learning

Dimensionality reduction

Working with Text Data

Working with Text Data

How to extract features from text data

From text to tokens – the NLP pipeline

From tokens to numbers – the document-term matrix

Text classification and sentiment analysis

Topic Modeling

Learning latent topics: goals and approaches

Latent semantic indexing

Probabilistic latent semantic analysis

Latent Dirichlet allocation

Word Embeddings

Word Embeddings

How word embeddings encode semantics

Word vectors from SEC filings using gensim

Sentiment analysis with Doc2vec

Bonus – Word2vec for translation

Deep Learning

Deep learning and AI

How to design a neural network

How to build a neural network using Python

How to train a neural network

How to use DL libraries

How to optimize neural network architectures

Convolutional Neural Networks

Convolutional Neural Networks

How ConvNets work

How to design and train a CNN using Python

Transfer learning – faster training with less data

How to detect objects

Recent developments

Recurrent Neural Networks

Recurrent Neural Networks

How to build and train RNNs using Python

Autoencoders and Generative Adversarial Nets

Autoencoders and Generative Adversarial Nets

How autoencoders work

Designing and training autoencoders using Python

Reinforcement Learning

Reinforcement Learning

Key elements of RL

How to solve RL problems

Dynamic programming – Value and Policy iteration

Deep reinforcement learning

Reinforcement learning for trading

Next Steps

Key takeaways and lessons learned

ML for trading in practice

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Efficient data storage with pandas

We'll be using many different data sets in this book, and it's worth comparing the main formats for efficiency and performance. In particular, we compare the following:

CSV: Comma-separated, standard flat text file format.
HDF5: Hierarchical data format, developed initially at the National Center for Supercomputing, is a fast and scalable storage format for numerical data, available in pandas using the PyTables library.
Parquet: A binary, columnar storage format, part of the Apache Hadoop ecosystem, that provides efficient data compression and encoding and has been developed by Cloudera and Twitter. It is available for pandas through the pyarrow library, led by Wes McKinney, the original author of pandas.

The storage_benchmark.ipynb notebook compares the performance of the preceding libraries using a test DataFrame that can be configured...