Book Image

Deep Reinforcement Learning with Python - Second Edition

By : Sudharsan Ravichandiran

Book Image

Deep Reinforcement Learning with Python - Second Edition

By: Sudharsan Ravichandiran

Overview of this book

With significant enhancements in the quality and quantity of algorithms in recent years, this second edition of Hands-On Reinforcement Learning with Python has been revamped into an example-rich guide to learning state-of-the-art reinforcement learning (RL) and deep RL algorithms with TensorFlow 2 and the OpenAI Gym toolkit. In addition to exploring RL basics and foundational concepts such as Bellman equation, Markov decision processes, and dynamic programming algorithms, this second edition dives deep into the full spectrum of value-based, policy-based, and actor-critic RL methods. It explores state-of-the-art algorithms such as DQN, TRPO, PPO and ACKTR, DDPG, TD3, and SAC in depth, demystifying the underlying math and demonstrating implementations through simple code examples. The book has several new chapters dedicated to new RL techniques, including distributional RL, imitation learning, inverse RL, and meta RL. You will learn to leverage stable baselines, an improvement of OpenAI’s baseline library, to effortlessly implement popular RL algorithms. The book concludes with an overview of promising approaches such as meta-learning and imagination augmented agents in research. By the end, you will become skilled in effectively employing RL and deep RL in your real-world projects.

Preface

Who this book is for

What this book covers

To get the most out of this book

Fundamentals of Reinforcement Learning

Fundamentals of Reinforcement Learning

Key elements of RL

The basic idea of RL

The RL algorithm

How RL differs from other ML paradigms

Markov Decision Processes

Fundamental concepts of RL

Applications of RL

Further reading

Free Chapter

A Guide to the Gym Toolkit

A Guide to the Gym Toolkit

Setting up our machine

Creating our first Gym environment

More Gym environments

Environment synopsis

Further reading

The Bellman Equation and Dynamic Programming

The Bellman Equation and Dynamic Programming

The Bellman equation

Dynamic programming

Is DP applicable to all environments?

Monte Carlo Methods

Monte Carlo Methods

Understanding the Monte Carlo method

Prediction and control tasks

Monte Carlo prediction

Monte Carlo control

Is the MC method applicable to all tasks?

Understanding Temporal Difference Learning

Understanding Temporal Difference Learning

Comparing the DP, MC, and TD methods

Further reading

Case Study – The MAB Problem

Case Study – The MAB Problem

The MAB problem

Applications of MAB

Finding the best advertisement banner using bandits

Contextual bandits

Further reading

Deep Learning Foundations

Deep Learning Foundations

Biological and artificial neurons

ANN and its layers

Exploring activation functions

Forward propagation in ANNs

How does an ANN learn?

Putting it all together

Recurrent Neural Networks

LSTM to the rescue

The architecture of CNNs

Generative adversarial networks

Further reading

A Primer on TensorFlow

A Primer on TensorFlow

What is TensorFlow?

Understanding computational graphs and sessions

Variables, constants, and placeholders

Introducing TensorBoard

Handwritten digit classification using TensorFlow

Introducing eager execution

Math operations in TensorFlow

TensorFlow 2.0 and Keras

Further reading

Deep Q Network and Its Variants

Deep Q Network and Its Variants

Playing Atari games using DQN

DQN with prioritized experience replay

The dueling DQN

The deep recurrent Q network

Further reading

Policy Gradient Method

Policy Gradient Method

Why policy-based methods?

Policy gradient intuition

Variance reduction methods

Further reading

Actor-Critic Methods – A2C and A3C

Actor-Critic Methods – A2C and A3C

Overview of the actor-critic method

Advantage actor-critic (A2C)

Asynchronous advantage actor-critic (A3C)

Further reading

Learning DDPG, TD3, and SAC

Learning DDPG, TD3, and SAC

Deep deterministic policy gradient

Twin delayed DDPG

Soft actor-critic

Further reading

TRPO, PPO, and ACKTR Methods

TRPO, PPO, and ACKTR Methods

Trust region policy optimization

Proximal policy optimization

Actor-critic using Kronecker-factored trust region

Further reading

Distributional Reinforcement Learning

Distributional Reinforcement Learning

Why distributional reinforcement learning?

Categorical DQN

Quantile Regression DQN

Distributed Distributional DDPG

Further reading

Imitation Learning and Inverse RL

Imitation Learning and Inverse RL

Supervised imitation learning

Deep Q learning from demonstrations

Inverse reinforcement learning

Generative adversarial imitation learning

Further reading

Deep Reinforcement Learning with Stable Baselines

Deep Reinforcement Learning with Stable Baselines

Installing Stable Baselines

Creating our first agent with Stable Baselines

Vectorized environments

Integrating custom environments

Playing Atari games with a DQN and its variants

Lunar lander using A2C

Swinging up a pendulum using DDPG

Training an agent to walk using TRPO

Training a cheetah bot to run using PPO

Implementing GAIL

Further reading

Reinforcement Learning Frontiers

Reinforcement Learning Frontiers

Meta reinforcement learning

Hierarchical reinforcement learning

Imagination augmented agents

Further reading

Other Books You May Enjoy

Other Books You May Enjoy

Index

Appendix 1 – Reinforcement Learning Algorithms

Appendix 1 – Reinforcement Learning Algorithms

Reinforcement learning algorithm

Value Iteration

Policy Iteration

First-Visit MC Prediction

Every-Visit MC Prediction

MC Prediction – the Q Function

MC Control Method

On-Policy MC Control – Exploring starts

On-Policy MC Control – Epsilon-Greedy

Off-Policy MC Control

On-Policy TD Control – SARSA

Off-Policy TD Control – Q Learning

Deep Q Learning

REINFORCE Policy Gradient

Policy Gradient with Reward-To-Go

REINFORCE with Baseline

Advantage Actor Critic

Asynchronous Advantage Actor-Critic

Deep Deterministic Policy Gradient

Twin Delayed DDPG

Soft Actor-Critic

Trust Region Policy Optimization

Categorical DQN

Distributed Distributional DDPG

Deep Q learning from demonstrations

MaxEnt Inverse Reinforcement Learning

MAML in Reinforcement Learning

Appendix 2 – Assessments

Appendix 2 – Assessments

Chapter 1 – Fundamentals of Reinforcement Learning

Chapter 2 – A Guide to the Gym Toolkit

Chapter 3 – The Bellman Equation and Dynamic Programming

Chapter 4 – Monte Carlo Methods

Chapter 5 – Understanding Temporal Difference Learning

Chapter 6 – Case Study – The MAB Problem

Chapter 7 – Deep Learning Foundations

Chapter 8 – A Primer on TensorFlow

Chapter 9 – Deep Q Network and Its Variants

Chapter 10 – Policy Gradient Method

Chapter 11 – Actor-Critic Methods – A2C and A3C

Chapter 12 – Learning DDPG, TD3, and SAC

Chapter 13 – TRPO, PPO, and ACKTR Methods

Chapter 14 – Distributional Reinforcement Learning

Chapter 15 – Imitation Learning and Inverse RL

Chapter 16 – Deep Reinforcement Learning with Stable Baselines

Chapter 17 – Reinforcement Learning Frontiers

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

The double DQN

We have learned that in DQN, the target value is computed as:

One of the problems with a DQN is that it tends to overestimate the Q value of the next state-action pair in the target:

This overestimation is due to the presence of the max operator. Let's see how this overestimation happens with an example. Suppose we are in a state and we have three actions a₁, a₂, and a₃. Assume a₃ is the optimal action in the state . When we estimate the Q values of all the actions in state , the estimated Q value will have some noise and differ from the actual value. Say, due to the noise, action a₂ will get a higher Q value than the optimal action a₃.

We know that the target value is computed as:

Now, if we select the best action as the one that has the maximum value then we will end up selecting the action a₂ instead of optimal action a₃, as shown here:

So, how can we get rid of this overestimation? We can get rid of this overestimation...