Book Image

Deep Reinforcement Learning with Python - Second Edition

By : Sudharsan Ravichandiran

Book Image

Deep Reinforcement Learning with Python - Second Edition

By: Sudharsan Ravichandiran

Overview of this book

With significant enhancements in the quality and quantity of algorithms in recent years, this second edition of Hands-On Reinforcement Learning with Python has been revamped into an example-rich guide to learning state-of-the-art reinforcement learning (RL) and deep RL algorithms with TensorFlow 2 and the OpenAI Gym toolkit. In addition to exploring RL basics and foundational concepts such as Bellman equation, Markov decision processes, and dynamic programming algorithms, this second edition dives deep into the full spectrum of value-based, policy-based, and actor-critic RL methods. It explores state-of-the-art algorithms such as DQN, TRPO, PPO and ACKTR, DDPG, TD3, and SAC in depth, demystifying the underlying math and demonstrating implementations through simple code examples. The book has several new chapters dedicated to new RL techniques, including distributional RL, imitation learning, inverse RL, and meta RL. You will learn to leverage stable baselines, an improvement of OpenAI’s baseline library, to effortlessly implement popular RL algorithms. The book concludes with an overview of promising approaches such as meta-learning and imagination augmented agents in research. By the end, you will become skilled in effectively employing RL and deep RL in your real-world projects.

Preface

Who this book is for

What this book covers

To get the most out of this book

Fundamentals of Reinforcement Learning

Fundamentals of Reinforcement Learning

Key elements of RL

The basic idea of RL

The RL algorithm

How RL differs from other ML paradigms

Markov Decision Processes

Fundamental concepts of RL

Applications of RL

Further reading

Free Chapter

A Guide to the Gym Toolkit

A Guide to the Gym Toolkit

Setting up our machine

Creating our first Gym environment

More Gym environments

Environment synopsis

Further reading

The Bellman Equation and Dynamic Programming

The Bellman Equation and Dynamic Programming

The Bellman equation

Dynamic programming

Is DP applicable to all environments?

Monte Carlo Methods

Monte Carlo Methods

Understanding the Monte Carlo method

Prediction and control tasks

Monte Carlo prediction

Monte Carlo control

Is the MC method applicable to all tasks?

Understanding Temporal Difference Learning

Understanding Temporal Difference Learning

Comparing the DP, MC, and TD methods

Further reading

Case Study – The MAB Problem

Case Study – The MAB Problem

The MAB problem

Applications of MAB

Finding the best advertisement banner using bandits

Contextual bandits

Further reading

Deep Learning Foundations

Deep Learning Foundations

Biological and artificial neurons

ANN and its layers

Exploring activation functions

Forward propagation in ANNs

How does an ANN learn?

Putting it all together

Recurrent Neural Networks

LSTM to the rescue

The architecture of CNNs

Generative adversarial networks

Further reading

A Primer on TensorFlow

A Primer on TensorFlow

What is TensorFlow?

Understanding computational graphs and sessions

Variables, constants, and placeholders

Introducing TensorBoard

Handwritten digit classification using TensorFlow

Introducing eager execution

Math operations in TensorFlow

TensorFlow 2.0 and Keras

Further reading

Deep Q Network and Its Variants

Deep Q Network and Its Variants

Playing Atari games using DQN

DQN with prioritized experience replay

The dueling DQN

The deep recurrent Q network

Further reading

Policy Gradient Method

Policy Gradient Method

Why policy-based methods?

Policy gradient intuition

Variance reduction methods

Further reading

Actor-Critic Methods – A2C and A3C

Actor-Critic Methods – A2C and A3C

Overview of the actor-critic method

Advantage actor-critic (A2C)

Asynchronous advantage actor-critic (A3C)

Further reading

Learning DDPG, TD3, and SAC

Learning DDPG, TD3, and SAC

Deep deterministic policy gradient

Twin delayed DDPG

Soft actor-critic

Further reading

TRPO, PPO, and ACKTR Methods

TRPO, PPO, and ACKTR Methods

Trust region policy optimization

Proximal policy optimization

Actor-critic using Kronecker-factored trust region

Further reading

Distributional Reinforcement Learning

Distributional Reinforcement Learning

Why distributional reinforcement learning?

Categorical DQN

Quantile Regression DQN

Distributed Distributional DDPG

Further reading

Imitation Learning and Inverse RL

Imitation Learning and Inverse RL

Supervised imitation learning

Deep Q learning from demonstrations

Inverse reinforcement learning

Generative adversarial imitation learning

Further reading

Deep Reinforcement Learning with Stable Baselines

Deep Reinforcement Learning with Stable Baselines

Installing Stable Baselines

Creating our first agent with Stable Baselines

Vectorized environments

Integrating custom environments

Playing Atari games with a DQN and its variants

Lunar lander using A2C

Swinging up a pendulum using DDPG

Training an agent to walk using TRPO

Training a cheetah bot to run using PPO

Implementing GAIL

Further reading

Reinforcement Learning Frontiers

Reinforcement Learning Frontiers

Meta reinforcement learning

Hierarchical reinforcement learning

Imagination augmented agents

Further reading

Other Books You May Enjoy

Other Books You May Enjoy

Index

Appendix 1 – Reinforcement Learning Algorithms

Appendix 1 – Reinforcement Learning Algorithms

Reinforcement learning algorithm

Value Iteration

Policy Iteration

First-Visit MC Prediction

Every-Visit MC Prediction

MC Prediction – the Q Function

MC Control Method

On-Policy MC Control – Exploring starts

On-Policy MC Control – Epsilon-Greedy

Off-Policy MC Control

On-Policy TD Control – SARSA

Off-Policy TD Control – Q Learning

Deep Q Learning

REINFORCE Policy Gradient

Policy Gradient with Reward-To-Go

REINFORCE with Baseline

Advantage Actor Critic

Asynchronous Advantage Actor-Critic

Deep Deterministic Policy Gradient

Twin Delayed DDPG

Soft Actor-Critic

Trust Region Policy Optimization

Categorical DQN

Distributed Distributional DDPG

Deep Q learning from demonstrations

MaxEnt Inverse Reinforcement Learning

MAML in Reinforcement Learning

Appendix 2 – Assessments

Appendix 2 – Assessments

Chapter 1 – Fundamentals of Reinforcement Learning

Chapter 2 – A Guide to the Gym Toolkit

Chapter 3 – The Bellman Equation and Dynamic Programming

Chapter 4 – Monte Carlo Methods

Chapter 5 – Understanding Temporal Difference Learning

Chapter 6 – Case Study – The MAB Problem

Chapter 7 – Deep Learning Foundations

Chapter 8 – A Primer on TensorFlow

Chapter 9 – Deep Q Network and Its Variants

Chapter 10 – Policy Gradient Method

Chapter 11 – Actor-Critic Methods – A2C and A3C

Chapter 12 – Learning DDPG, TD3, and SAC

Chapter 13 – TRPO, PPO, and ACKTR Methods

Chapter 14 – Distributional Reinforcement Learning

Chapter 15 – Imitation Learning and Inverse RL

Chapter 16 – Deep Reinforcement Learning with Stable Baselines

Chapter 17 – Reinforcement Learning Frontiers

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

What this book covers

Chapter 1, Fundamentals of Reinforcement Learning, helps you build a strong foundation on RL concepts. We will learn about the key elements of RL, the Markov decision process, and several important fundamental concepts such as action spaces, policies, episodes, the value function, and the Q function. At the end of the chapter, we will learn about some of the interesting applications of RL and we will also look into the key terms and terminologies frequently used in RL.

Chapter 2, A Guide to the Gym Toolkit, provides a complete guide to OpenAI's Gym toolkit. We will understand several interesting environments provided by Gym in detail by implementing them. We will begin our hands-on RL journey from this chapter by implementing several fundamental RL concepts using Gym.

Chapter 3, The Bellman Equation and Dynamic Programming, will help us understand the Bellman equation in detail with extensive math. Next, we will learn two interesting classic RL algorithms called the value and policy iteration methods, which we can use to find the optimal policy. We will also see how to implement value and policy iteration methods for solving the Frozen Lake problem.

Chapter 4, Monte Carlo Methods, explains the model-free method, Monte Carlo. We will learn what prediction and control tasks are, and then we will look into Monte Carlo prediction and Monte Carlo control methods in detail. Next, we will implement the Monte Carlo method to solve the blackjack game using the Gym toolkit.

Chapter 5, Understanding Temporal Difference Learning, deals with one of the most popular and widely used model-free methods called Temporal Difference (TD) learning. First, we will learn how the TD prediction method works in detail, and then we will explore the on-policy TD control method called SARSA and the off-policy TD control method called Q learning in detail. We will also implement TD control methods to solve the Frozen Lake problem using Gym.

Chapter 6, Case Study – The MAB Problem, explains one of the classic problems in RL called the multi-armed bandit (MAB) problem. We will start the chapter by understanding what the MAB problem is and then we will learn about several exploration strategies such as epsilon-greedy, softmax exploration, upper confidence bound, and Thompson sampling methods for solving the MAB problem in detail.

Chapter 7, Deep Learning Foundations, helps us to build a strong foundation on deep learning. We will start the chapter by understanding how artificial neural networks work. Then we will learn several interesting deep learning algorithms, such as recurrent neural networks, LSTM networks, convolutional neural networks, and generative adversarial networks.

Chapter 8, A Primer on TensorFlow, deals with one of the most popular deep learning libraries called TensorFlow. We will understand how to use TensorFlow by implementing a neural network to recognize handwritten digits. Next, we will learn to perform several math operations using TensorFlow. Later, we will learn about TensorFlow 2.0 and see how it differs from the previous TensorFlow versions.

Chapter 9, Deep Q Network and Its Variants, enables us to kick-start our deep RL journey. We will learn about one of the most popular deep RL algorithms called the Deep Q Network (DQN). We will understand how DQN works step by step along with the extensive math. We will also implement a DQN to play Atari games. Next, we will explore several interesting variants of DQN, called Double DQN, Dueling DQN, DQN with prioritized experience replay, and DRQN.

Chapter 10, Policy Gradient Method, covers policy gradient methods. We will understand how the policy gradient method works along with the detailed derivation. Next, we will learn several variance reduction methods such as policy gradient with reward-to-go and policy gradient with baseline. We will also understand how to train an agent for the Cart Pole balancing task using policy gradient.

Chapter 11, Actor-Critic Methods – A2C and A3C, deals with several interesting actor-critic methods such as advantage actor-critic and asynchronous advantage actor-critic. We will learn how these actor-critic methods work in detail, and then we will implement them for a mountain car climbing task using OpenAI Gym.

Chapter 12, Learning DDPG, TD3, and SAC, covers state-of-the-art deep RL algorithms such as deep deterministic policy gradient, twin delayed DDPG, and soft actor, along with step by step derivation. We will also learn how to implement the DDPG algorithm for performing the inverted pendulum swing-up task using Gym.

Chapter 13, TRPO, PPO, and ACKTR Methods, deals with several popular policy gradient methods such as TRPO and PPO. We will dive into the math behind TRPO and PPO step by step and understand how TRPO and PPO helps an agent find the optimal policy. Next, we will learn to implement PPO for performing the inverted pendulum swing-up task. At the end, we will learn about the actor-critic method called actor-critic using Kronecker-Factored trust region in detail.

Chapter 14, Distributional Reinforcement Learning, covers distributional RL algorithms. We will begin the chapter by understanding what distributional RL is. Then we will explore several interesting distributional RL algorithms such as categorical DQN, quantile regression DQN, and distributed distributional DDPG.

Chapter 15, Imitation Learning and Inverse RL, explains imitation and inverse RL algorithms. First, we will understand how supervised imitation learning, DAgger, and deep Q learning from demonstrations work in detail. Next, we will learn about maximum entropy inverse RL. At the end of the chapter, we will learn about generative adversarial imitation learning.

Chapter 16, Deep Reinforcement Learning with Stable Baselines, helps us to understand how to implement deep RL algorithms using a library called Stable Baselines. We will learn what Stable Baselines is and how to use it in detail by implementing several interesting Deep RL algorithms such as DQN, A2C, DDPG TRPO, and PPO.

Chapter 17, Reinforcement Learning Frontiers, covers several interesting avenues in RL, such as meta RL, hierarchical RL, and imagination augmented agents in detail.