Book Image

Deep Reinforcement Learning Hands-On

By : Maxim Lapan

Book Image

Deep Reinforcement Learning Hands-On

By: Maxim Lapan

Overview of this book

Deep Reinforcement Learning Hands-On is a comprehensive guide to the very latest DL tools and their limitations. You will evaluate methods including Cross-entropy and policy gradients, before applying them to real-world environments. Take on both the Atari set of virtual games and family favorites such as Connect4. The book provides an introduction to the basics of RL, giving you the know-how to code intelligent learning agents to take on a formidable array of practical tasks. Discover how to implement Q-learning on 'grid world' environments, teach your agent to buy and trade stocks, and find out how natural language models are driving the boom in chatbots.

Deep Reinforcement Learning Hands-On

Deep Reinforcement Learning Hands-On

Contributors

Preface

Other Books You May Enjoy

Other Books You May Enjoy

Free Chapter

What is Reinforcement Learning?

What is Reinforcement Learning?

Learning – supervised, unsupervised, and reinforcement

RL formalisms and relations

Markov decision processes

OpenAI Gym

The anatomy of the agent

Hardware and software requirements

The random CartPole agent

The extra Gym functionality – wrappers and monitors

Deep Learning with PyTorch

Deep Learning with PyTorch

NN building blocks

Final glue – loss functions and optimizers

Monitoring with TensorBoard

Example – GAN on Atari images

The Cross-Entropy Method

The Cross-Entropy Method

Taxonomy of RL methods

Practical cross-entropy

Cross-entropy on CartPole

Cross-entropy on FrozenLake

Theoretical background of the cross-entropy method

Tabular Learning and the Bellman Equation

Tabular Learning and the Bellman Equation

Value, state, and optimality

The Bellman equation of optimality

Value of action

The value iteration method

Value iteration in practice

Q-learning for FrozenLake

Deep Q-Networks

Deep Q-Networks

Real-life value iteration

Tabular Q-learning

Deep Q-learning

DQN Extensions

The PyTorch Agent Net library

Prioritized replay buffer

Categorical DQN

Combining everything

Stocks Trading Using RL

Stocks Trading Using RL

Problem statements and key decisions

The trading environment

Policy Gradients – An Alternative

Policy Gradients – An Alternative

Values and policy

The REINFORCE method

REINFORCE issues

The Actor-Critic Method

The Actor-Critic Method

Variance reduction

CartPole variance

A2C on Pong results

Tuning hyperparameters

Asynchronous Advantage Actor-Critic

Asynchronous Advantage Actor-Critic

Correlation and sample efficiency

Adding an extra A to A2C

Multiprocessing in Python

A3C – data parallelism

A3C – gradients parallelism

Chatbots Training with RL

Chatbots Training with RL

Chatbots overview

Deep NLP basics

Training of seq2seq

The chatbot example

Web Navigation

OpenAI Universe

Simple clicking approach

Human demonstrations

Adding text description

Continuous Action Space

Continuous Action Space

Why a continuous space?

The Actor-Critic (A2C) method

Deterministic policy gradients

Distributional policy gradients

Trust Regions – TRPO, PPO, and ACKTR

Trust Regions – TRPO, PPO, and ACKTR

Proximal Policy Optimization

Trust Region Policy Optimization

A2C using ACKTR

Black-Box Optimization in RL

Black-Box Optimization in RL

Black-box methods

Evolution strategies

ES on HalfCheetah

Genetic algorithms

Beyond Model-Free – Imagination

Beyond Model-Free – Imagination

Model-based versus model-free

Model imperfections

Imagination-augmented agent

I2A on Atari Breakout

Experiment results

AlphaGo Zero

The AlphaGo Zero method

Connect4 results

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

A2C, with ACKTR
- about / A2C using ACKTR
- implementing / Implementation
- results / Results
A2C agent / Adding an extra A to A2C
A2C baseline
- about / A2C baseline
- results / Results
- videos recording / Videos recording
A2C on Pong
- about / A2C on Pong
- results / A2C on Pong results
A3C parallelization / A3C – data parallelism
- results / Results
- parallelizm of gradients / A3C – gradients parallelism, Results
action space
- about / Action space
actor-critic / Actor-critic
Actor-Critic (A2C) / Self-play
Actor-Critic (A2C) method
- about / The Actor-Critic (A2C) method
- implementation / Implementation
- results / Results
- models, using / Using models and recording videos
- videos, recording / Using models and recording videos
actor-critic parallelization
- approaches / Adding an extra A to A2C
agent
- anatomy / The anatomy of the agent
AgentNet
- reference / The PyTorch Agent Net library
AlphaGo Zero method
- overview / Overview
- MCTS / Monte-Carlo Tree Search
- self play / Self-play
- training / Training and evaluation
- evaluation / Training and evaluation
Asynchronous Advantage Actor-Critic (A3C)
- about / Proximal Policy Optimization
Asynchronous Advantage Actor-Critic (A3C) agent / Model imperfections
Asynchronous Advantage Actor-Critic (A3C) method / PG on Pong, Why a continuous space?
Atari transformations
- used by RL researchers / Wrappers

B

bar
- about / Data
baseline agent
- training / The baseline agent
Baselines
- reference / Wrappers
basic DQN
- about / Basic DQN
Bellman equation
- about / The Bellman equation of optimality
Bilingual evaluation understudy (BLEU) score / Bilingual evaluation understudy (BLEU) score
black-box methods
- about / Black-box methods
- properties / Black-box methods
board games
- about / Board games
branching factor / Monte-Carlo Tree Search
browser automation
- and RL / Browser automation and RL

C

candlestick chart
- about / Data
CartPole variance / CartPole variance
categorical DQN
- about / Categorical DQN
- implementing / Implementation
- results / Results
chatbot example
- about / The chatbot example
- structure / The example structure
- cornell.py file / Modules: cornell.py and data.py
- data.py file / Modules: cornell.py and data.py
- BLEU score / BLEU score and utils.py
- utils.py module / BLEU score and utils.py
- model / Model
- cross-entropy method / Training: cross-entropy
- training code / Running the training
- data, checking / Checking the data
- trained model / Testing the trained model
- SCST training / Training: SCST
- SCST training, running / Running the SCST training
- results / Results
- Telegram bot / Telegram bot
chatbots
- overview / Chatbots overview
- entertainment human-mimicking / The chatbot example
- goal-oriented / The chatbot example
Connect4 bot
- about / Connect4 bot
- game model / Game model
- MCTS implementation / Implementing MCTS
- model / Model
- training process / Training
- testing / Testing and comparison
- comparison / Testing and comparison
- results / Connect4 results
continuous space
- need for / Why a continuous space?
convolutional model / Models
convolution model / The convolution model
Cornell Movie-Dialogs Corpus
- reference / The example structure
correlation / Correlation and sample efficiency
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
- about / Evolution strategies
cross-entropy
- on CartPole / Cross-entropy on CartPole
- on FrozenLake / Cross-entropy on FrozenLake
- theoretical background / Theoretical background of the cross-entropy method
curriculum learning / Log-likelihood training
custom layers
- about / Custom layers

D

data
- about / Data
decoder / Encoder-Decoder
deep deterministic policy gradients (DDPG)
- about / Deterministic policy gradients
deep GA
- about / Deep GA
deep learning (DL) / Chatbots overview
Deep Learning (DL) / Hardware and software requirements
DeepMind Control Suite / Things to try
deep NLP basics
- about / Deep NLP basics
- RNNs / Recurrent Neural Networks
- embeddings / Embeddings
- Encoder-Decoder / Encoder-Decoder
deep Q-learning
- about / Deep Q-learning
- interaction, with environment / Interaction with the environment
- SGD optimisation / SGD optimization
- correlation, between steps / Correlation between steps
- Markov property / The Markov property
deep Q-network (DQN) method / Why a continuous space?
deterministic policy gradients
- about / Deterministic policy gradients
- exploration / Exploration
- implementation / Implementation
- results / Results
- videos, recording / Recording videos
Dilbert Reward Process (DRP)
- about / Markov reward process
distributional policy gradients
- about / Distributional policy gradients
- architecture / Architecture
- implementation / Implementation
- results / Results
Docker
- reference / Installation
double DQN
- about / Double DQN
- implementing / Implementation
- results / Results
DQN improvements
- combining / Combining everything
- implementation / Implementation
- results / Results
DQN model / DQN model
DQN on Pong
- about / DQN on Pong
- wrappers / Wrappers
- training / Training
- running / Running and performance
- performance / Running and performance
- working / Your model in action
DQN training
- about / The final form of DQN training
dueling DQN
- about / Dueling DQN
- implementing / Implementation
- results / Results

E

ELIZA
- reference / Chatbots overview
EM weights
- training / Training EM weights
encoder / Encoder-Decoder
Encoder-Decoder / Encoder-Decoder
entropy / Theoretical background of the cross-entropy method
environment
- about / The anatomy of the agent
environment model (EM) / Imagination-augmented agent
environments
- about / Environments
- MuJoCo / Environments
- PyBullet / Environments
ES, on CartPole
- about / ES on CartPole
- results / Results
ES, on HalfCheetah
- about / ES on HalfCheetah
- results / Results
evolution strategies (ES)
- about / Evolution strategies

F

factorized Gaussian noise / Noisy networks
feed-forward model / The feed-forward model
fitness function / Black-box methods
FrozenLake
- value iteration method / Value iteration in practice
- Q-learning / Q-learning for FrozenLake

G

GA, on CartPole
- about / GA on CartPole
- results / Results
GA, on Cheetah
- about / GA on Cheetah
- results / Results
GAN on Atari images
- example / Example – GAN on Atari images
GA tweaks
- about / GA tweaks
- deep GA / Deep GA
- novelty search / Novelty search
Generative Adversarial Networks (GANs)
- about / Learning – supervised, unsupervised, and reinforcement
generative adversarial networks (GANs) / Example – GAN on Atari images
genetic algorithms (GA)
- about / Genetic algorithms
GPU tensors / GPU tensors
gradients
- about / Gradients
- notebook gradients / Gradients
Gym / Hardware and software requirements

H

hardware requisites / Hardware and software requirements
hidden state / Recurrent Neural Networks
human demonstrations
- about / Human demonstrations
- recording / Recording the demonstrations
- recording format / Recording format
- training, with demonstrations / Training using demonstrations
- results / Results
- TicTacToe problem / TicTacToe problem
hyperparameter tuning
- about / Tuning hyperparameters
- learning rate (LR) / Learning rate
- entropy beta / Entropy beta
- count of environments / Count of environments
- batch size / Batch size

I

I2A, on Atari Breakout
- about / I2A on Atari Breakout
- baseline A2C agent / The baseline A2C agent
- EM training / EM training
- imagination agent / The imagination agent
- implementing / The I2A model
- rollout encoder / The Rollout encoder
- training process / Training of I2A
I2A model
- training with / Training with the I2A model
imagination-augmented agent
- about / Imagination-augmented agent
- environment model / The environment model
- rollout policy / The rollout policy
- rollout encoder / The rollout encoder
- paper results / Paper results
imagination path / Imagination-augmented agent
independent Gaussian noise
- about / Noisy networks

K

KaiTai binary parser language
- reference / Recording format
key decisions / Problem statements and key decisions
Kullback-Leibler (KL)-divergence / PG on CartPole, RL in seq2seq
Kullback-Leibler (KL) divergence / Theoretical background of the cross-entropy method

L

loss functions
- about / Final glue – loss functions and optimizers, Loss functions
- nn.MSELoss / Loss functions
- nn.BCELoss / Loss functions
- nn.CrossEntropyLoss / Loss functions
- nn.NLLLoss / Loss functions

M

machine learning (ML) / Chatbots overview
Markov chain
- about / Markov process
Markov decision process
- about / Markov decision process
Markov Decision Process (MDP) / Issues with simple clicking
Markov decision processes
- about / Markov decision processes
Markov process
- about / Markov process
Markov property
- about / Markov process
/ The Markov property
Markov reward process
- about / Markov reward process
mean squared error (MSE) / EM training, Training and evaluation
mean square loss (MSE)
- about / The Actor-Critic (A2C) method
minimax
- about / Board games
Mini World of Bits (MiniWoB) / Mini World of Bits benchmark
Mini World of Bits benchmark / Mini World of Bits benchmark
model-based approach
- versus , model-free approach / Model-based versus model-free
- model imperfections / Model imperfections
models
- about / Models
- convolutional model / Models
Monitor / Monitor
Monte-Carlo Tree Search (MCTS) / Overview
MuJoCo
- URL / Environments
- about / Environments
multiprocessing
- in Python / Multiprocessing in Python

N

N-step DQN
- about / N-step DQN
- implementing / Implementation
natural language / Chatbots overview
neural network
- building blocks / NN building blocks
neural network (NN) / Problem statements and key decisions, Monte-Carlo Tree Search
neural networks (NNs)
- about / Deterministic policy gradients
noisy networks
- about / Noisy networks
- implementing / Implementation
- results / Results
notebook gradients / Gradients
novelty search
- about / Novelty search
- implementing / Novelty search
NumPy / Hardware and software requirements

O

OpenAI
- reference / OpenAI Gym API
OpenAI Gym API
- about / OpenAI Gym API
- action space / Action space
- observation space / Observation space
- environment / The environment
- environment, creating / Creation of the environment
- CartPole session / The CartPole session
OpenAI Universe
- reference / Creation of the environment, OpenAI Universe
- about / OpenAI Universe
- installing / Installation
- actions / Actions and observations
- observations / Actions and observations
- environment creation / Environment creation
- MiniWoB stability / MiniWoB stability
OpenCV Python bindings / Hardware and software requirements
optimality
- about / Value, state, and optimality
optimizers
- about / Final glue – loss functions and optimizers, Optimizers
- SGD / Optimizers
- RMSprop / Optimizers
- Adagrad / Optimizers
Ornstein-Uhlenbeck (OU) process
- about / Implementation

P

partially-observable Markov decision process (POMDP)
- about / Implementation
partially observable MDPs (POMDP) / The Markov property
PG method, on CartPole
- about / PG on CartPole
- results / Results
PG method, on Pong
- about / PG on Pong
- results / Results
policy / Values and policy
- need for / Why policy?
- representing / Policy representation
policy-based method
- versus value-based method / Policy-based versus value-based methods
policy gradient (PG) / Training of I2A
policy gradients / Policy gradients
PPO
- about / Proximal Policy Optimization
- implementing / Implementation
- results / Results
practical cross-entropy / Practical cross-entropy
prioritized replay buffer
- about / Prioritized replay buffer
- implementing / Implementation
- results / Results
problem statements / Problem statements and key decisions
Ptan
- reference / Hardware and software requirements
PyBullet
- about / Environments
Python
- module multiprocessing / Multiprocessing in Python
PyTorch / Hardware and software requirements
- about / ES on HalfCheetah
PyTorch Agent Net library
- about / The PyTorch Agent Net library
- design principles / The PyTorch Agent Net library
- agent entity / Agent
- agent’s experience / Agent's experience
- experience buffer / Experience buffer
- gym env wrappers / Gym env wrappers
PyTorch documentation
- reference / Tensor operations

Q

Q-learning, for FrozenLake
- about / Q-learning for FrozenLake

R

Random CartPole agent / The random CartPole agent
real-life value iteration / Real-life value iteration
recurrent neural network (RNN) / The Rollout encoder
reinforcement learning
- about / Learning – supervised, unsupervised, and reinforcement
- formalisms / RL formalisms and relations
- relations / RL formalisms and relations
- reward / Reward
- agent / The agent
- environment / The environment
- actions / Actions
- observations / Observations
- in seq2seq / RL in seq2seq
REINFORCE method
- about / The REINFORCE method
- CartPole example / The CartPole example
- results / Results
- issues / REINFORCE issues, Full episodes are required, High gradients variance, Exploration, Correlation between samples
Remote Framebuffer Protocol (RFP) / Recording format
- reference / Recording format
results
- feed-forward model / The feed-forward model
- convolution model / The convolution model
RL methods
- taxonomy / Taxonomy of RL methods
roboschool
- about / Roboschool
- installation link / Roboschool

S

sample efficiency / Value iteration in practice, Correlation and sample efficiency
scalar tensors / Scalar tensors
seq2seq
- reinforcement learning / RL in seq2seq
seq2seq model
- about / Encoder-Decoder
- training / Training of seq2seq
- log-likelihood training / Log-likelihood training
- Bilingual evaluation understudy (BLEU) score / Bilingual evaluation understudy (BLEU) score
- self-critical sequence training / Self-critical sequence training
simple clicking approach
- about / Simple clicking approach
- grid actions / Grid actions
- example overview / Example overview
- model / Model
- training code / Training code, Starting containers
- starting containers / Starting containers
- training process / Training process
- learned policy, checking / Checking the learned policy
- issues, with simple clicking / Issues with simple clicking
software requisites / Hardware and software requirements
stochastic
- about / Deterministic policy gradients
stochastic gradient descent (SGD) / Deep Q-learning, Log-likelihood training
- about / Deterministic policy gradients
stochastic gradient descent (SGD) method
- about / ES on HalfCheetah
supervised learning
- about / Learning – supervised, unsupervised, and reinforcement
supervised learning problems
- examples / Learning – supervised, unsupervised, and reinforcement

T

tabular Q-learning
- about / Tabular Q-learning
teacher forcing / Log-likelihood training
Telegram bot
- about / Telegram bot
- reference / Telegram bot
TensorBoard
- monitoring with / Monitoring with TensorBoard
- plotting stuff / Plotting stuff
tensorboard-pytorch
- reference / TensorBoard 101
TensorBoard 101 / TensorBoard 101
tensors
- about / Tensors
- creating / Creation of tensors
- scalar tensors / Scalar tensors
- operations / Tensor operations
- GPU tensors / GPU tensors
- and gradients / Tensors and gradients
text description
- adding / Adding text description
- results / Results
TicTacToe
- game tree / Monte-Carlo Tree Search
trading
- about / Trading
trading environment / The trading environment
training code / Training code
tree pruning
- about / Board games
TRPO
- about / Trust Region Policy Optimization
- implementing / Implementation
- results / Results
trust region policy optimization (TRPO) / Model imperfections

U

unsupervised learning
- about / Learning – supervised, unsupervised, and reinforcement

V

value
- about / Value, state, and optimality
- calculating / Value, state, and optimality
value-based method
- versus policy-based method / Policy-based versus value-based methods
value iteration method
- about / The value iteration method
- working, for FrozenLake / Value iteration in practice
- reward table / Value iteration in practice
- transitions table / Value iteration in practice
- value table / Value iteration in practice
value of action
- about / Value of action
value of state
- about / Value, state, and optimality
values / Values and policy
variance reduction / Variance reduction
VcXsrv
- reference / Monitor
virtual network computing
- reference / Mini World of Bits benchmark

W

web navigation
- about / Web navigation
word2vec / Embeddings
word embeddings / Embeddings
wrappers / Wrappers
wrappers, OpenAI Baselines project
- reference / Gym env wrappers

X

Xvfb (X11 virtual framebuffer) / Monitor