Book Image

Deep Reinforcement Learning Hands-On - Second Edition

By : Maxim Lapan

5 (2)

Book Image

Deep Reinforcement Learning Hands-On - Second Edition

5 (2)

By: Maxim Lapan

Overview of this book

Deep Reinforcement Learning Hands-On, Second Edition is an updated and expanded version of the bestselling guide to the very latest reinforcement learning (RL) tools and techniques. It provides you with an introduction to the fundamentals of RL, along with the hands-on ability to code intelligent learning agents to perform a range of practical tasks. With six new chapters devoted to a variety of up-to-the-minute developments in RL, including discrete optimization (solving the Rubik's Cube), multi-agent methods, Microsoft's TextWorld environment, advanced exploration techniques, and more, you will come away from this book with a deep understanding of the latest innovations in this emerging field. In addition, you will gain actionable insights into such topic areas as deep Q-networks, policy gradient methods, continuous control problems, and highly scalable, non-gradient methods. You will also discover how to build a real hardware robot trained with RL for less than $100 and solve the Pong environment in just 30 minutes of training using step-by-step code optimization. In short, Deep Reinforcement Learning Hands-On, Second Edition, is your companion to navigating the exciting complexities of RL as it helps you attain experience and knowledge through real-world examples.

Preface

Why I wrote this book

Who this book is for

What this book covers

To get the most out of this book

What Is Reinforcement Learning?

What Is Reinforcement Learning?

Supervised learning

Unsupervised learning

Reinforcement learning

RL's complications

The theoretical foundations of RL

Free Chapter

OpenAI Gym

The anatomy of the agent

Hardware and software requirements

The OpenAI Gym API

The random CartPole agent

Extra Gym functionality – wrappers and monitors

Deep Learning with PyTorch

Deep Learning with PyTorch

NN building blocks

The final glue – loss functions and optimizers

Monitoring with TensorBoard

Example – GAN on Atari images

The Cross-Entropy Method

The Cross-Entropy Method

The taxonomy of RL methods

The cross-entropy method in practice

The cross-entropy method on CartPole

The cross-entropy method on FrozenLake

The theoretical background of the cross-entropy method

Tabular Learning and the Bellman Equation

Tabular Learning and the Bellman Equation

Value, state, and optimality

The Bellman equation of optimality

The value of the action

The value iteration method

Value iteration in practice

Q-learning for FrozenLake

Deep Q-Networks

Deep Q-Networks

Real-life value iteration

Tabular Q-learning

Deep Q-learning

Higher-Level RL Libraries

Higher-Level RL Libraries

Why RL libraries?

The PTAN library

The PTAN CartPole solver

Other RL libraries

DQN Extensions

Prioritized replay buffer

Categorical DQN

Combining everything

Ways to Speed up RL

Ways to Speed up RL

Why speed matters

The computation graph in PyTorch

Several environments

Play and train in separate processes

Tweaking wrappers

Benchmark summary

Going hardcore: CuLE

Stocks Trading Using RL

Stocks Trading Using RL

Problem statements and key decisions

The trading environment

Policy Gradients – an Alternative

Policy Gradients – an Alternative

Values and policy

The REINFORCE method

REINFORCE issues

Policy gradient methods on CartPole

Policy gradient methods on Pong

The Actor-Critic Method

The Actor-Critic Method

Variance reduction

CartPole variance

A2C on Pong results

Tuning hyperparameters

Asynchronous Advantage Actor-Critic

Asynchronous Advantage Actor-Critic

Correlation and sample efficiency

Adding an extra A to A2C

Multiprocessing in Python

A3C with data parallelism

A3C with gradients parallelism

Training Chatbots with RL

Training Chatbots with RL

An overview of chatbots

Chatbot training

The deep NLP basics

Seq2seq training

Chatbot example

Dataset exploration

Training: cross-entropy

Models tested on data

The TextWorld Environment

The TextWorld Environment

Interactive fiction

The environment

The command generation model

Web Navigation

OpenAI Universe

The simple clicking approach

Human demonstrations

Adding text descriptions

Continuous Action Space

Continuous Action Space

Why a continuous space?

Deterministic policy gradients

Distributional policy gradients

RL in Robotics

Robots and robotics

The first training objective

The emulator and the model

DDPG training and results

Controlling the hardware

Policy experiments

Trust Regions – PPO, TRPO, ACKTR, and SAC

Trust Regions – PPO, TRPO, ACKTR, and SAC

The A2C baseline

Black-Box Optimization in RL

Black-Box Optimization in RL

Black-box methods

Evolution strategies

Genetic algorithms

Advanced Exploration

Advanced Exploration

Why exploration is important

What's wrong with ε-greedy?

Alternative ways of exploration

MountainCar experiments

Atari experiments

Beyond Model-Free – Imagination

Beyond Model-Free – Imagination

Model-based methods

The imagination-augmented agent

I2A on Atari Breakout

Experiment results

AlphaGo Zero

The AlphaGo Zero method

The Connect 4 bot

Connect 4 results

RL in Discrete Optimization

RL in Discrete Optimization

RL's reputation

The Rubik's Cube and combinatorial optimization

Optimality and God's number

Approaches to cube solving

The training process

The model application

The paper's results

The code outline

The experiment results

Further improvements and experiments

Multi-agent RL

Multi-agent RL explained

The MAgent environment

Deep Q-network for tigers

Collaboration by the tigers

Training both tigers and deer

The battle between equal actors

Other Books You May Enjoy

Other Books You May Enjoy

Index

Customer Reviews

5 (2)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

To get the most out of this book

All the chapters in this book describing RL methods have the same structure: in the beginning, we discuss the motivation of the method, its theoretical foundation, and the idea behind it. Then, we follow several examples of the method applied to different environments with the full source code.

You can use the book in different ways:

To quickly become familiar with some method, you can read only the introductory part of the relevant chapter
To get a deeper understanding of the way the method is implemented, you can read the code and the comments around it
To gain a deep familiarity with the method (the best way to learn, I believe) you can try to reimplement the method and make it work, using the provided source code as a reference point

In any case, I hope the book will be useful for you!

Download the example code files

You can download the example code files for this book from your account at www.packt.com/. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at http://www.packt.com.
Select the Support tab.
Click on Code Downloads.
Enter the name of the book in the Search box and follow the on-screen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition. In case there’s an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781838826994_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example; “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

def grads_func(proc_name, net, device, train_queue):
    envs = [make_env() for _ in range(NUM_ENVS)]
    agent = ptan.agent.PolicyAgent(
        lambda x: net(x)[0], device=device, apply_softmax=True)
    exp_source = ptan.experience.ExperienceSourceFirstLast(
        envs, agent, gamma=GAMMA, steps_count=REWARD_STEPS)
    batch = []
    frame_idx = 0
    writer = SummaryWriter(comment=proc_name)

Any command-line input or output is written as follows:

rl_book_samples/Chapter11$ ./02_a3c_grad.py --cuda -n final

Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this. For example: “Select System info from the Administration panel.”

Warnings or important notes appear like this.

Tips and tricks appear like this.