Deep Reinforcement Learning Hands-On - Second Edition

By : Maxim Lapan

5 (2)

Buy this Book

Deep Reinforcement Learning Hands-On - Second Edition

5 (2)

By: Maxim Lapan

Buy this Book

Overview of this book

Deep Reinforcement Learning Hands-On, Second Edition is an updated and expanded version of the bestselling guide to the very latest reinforcement learning (RL) tools and techniques. It provides you with an introduction to the fundamentals of RL, along with the hands-on ability to code intelligent learning agents to perform a range of practical tasks. With six new chapters devoted to a variety of up-to-the-minute developments in RL, including discrete optimization (solving the Rubik's Cube), multi-agent methods, Microsoft's TextWorld environment, advanced exploration techniques, and more, you will come away from this book with a deep understanding of the latest innovations in this emerging field. In addition, you will gain actionable insights into such topic areas as deep Q-networks, policy gradient methods, continuous control problems, and highly scalable, non-gradient methods. You will also discover how to build a real hardware robot trained with RL for less than $100 and solve the Pong environment in just 30 minutes of training using step-by-step code optimization. In short, Deep Reinforcement Learning Hands-On, Second Edition, is your companion to navigating the exciting complexities of RL as it helps you attain experience and knowledge through real-world examples.

Preface

Why I wrote this book

The approach

Who this book is for

What this book covers

To get the most out of this book

Get in touch

What Is Reinforcement Learning?

Supervised learning

Unsupervised learning

Reinforcement learning

RL's complications

RL formalisms

The theoretical foundations of RL

Summary

Free Chapter

OpenAI Gym

The anatomy of the agent

Hardware and software requirements

The OpenAI Gym API

The random CartPole agent

Extra Gym functionality – wrappers and monitors

Summary

Deep Learning with PyTorch

Tensors

Gradients

NN building blocks

Custom layers

The final glue – loss functions and optimizers

Monitoring with TensorBoard

Example – GAN on Atari images

PyTorch Ignite

Summary

The Cross-Entropy Method

The taxonomy of RL methods

The cross-entropy method in practice

The cross-entropy method on CartPole

The cross-entropy method on FrozenLake

The theoretical background of the cross-entropy method

Summary

Tabular Learning and the Bellman Equation

Value, state, and optimality

The Bellman equation of optimality

The value of the action

The value iteration method

Value iteration in practice

Q-learning for FrozenLake

Summary

Deep Q-Networks

Real-life value iteration

Summary

Higher-Level RL Libraries

Why RL libraries?

The PTAN library

The PTAN CartPole solver

Other RL libraries

Summary

DQN Extensions

Prioritized replay buffer

Summary

Ways to Speed up RL

The computation graph in PyTorch

Several environments

Play and train in separate processes

Summary

Stocks Trading Using RL

Trading

Data

Problem statements and key decisions

The trading environment

Models

Training code

Results

Things to try

Summary

Policy Gradients – an Alternative

Values and policy

The REINFORCE method

REINFORCE issues

Policy gradient methods on CartPole

Policy gradient methods on Pong

Summary

The Actor-Critic Method

Tuning hyperparameters

Summary

Asynchronous Advantage Actor-Critic

Correlation and sample efficiency

Adding an extra A to A2C

Multiprocessing in Python

A3C with data parallelism

A3C with gradients parallelism

Summary

Training Chatbots with RL

An overview of chatbots

Training: cross-entropy

Training: SCST

Models tested on data

Telegram bot

Summary

The TextWorld Environment

Interactive fiction

The environment

Baseline DQN

The command generation model

Summary

Web Navigation

Web navigation

OpenAI Universe

The simple clicking approach

Human demonstrations

Adding text descriptions

Things to try

Summary

Continuous Action Space

Why a continuous space?

The A2C method

Deterministic policy gradients

Distributional policy gradients

Things to try

Summary

RL in Robotics

Robots and robotics

The first training objective

The emulator and the model

DDPG training and results

Controlling the hardware

Policy experiments

Summary

Trust Regions – PPO, TRPO, ACKTR, and SAC

Roboschool

The A2C baseline

PPO

TRPO

ACKTR

SAC

Summary

Black-Box Optimization in RL

Summary

Advanced Exploration

Why exploration is important

What's wrong with ε-greedy?

Alternative ways of exploration

MountainCar experiments

Atari experiments

Summary

References

Beyond Model-Free – Imagination

Model-based methods

The imagination-augmented agent

I2A on Atari Breakout

Experiment results

Summary

References

AlphaGo Zero

Board games

The AlphaGo Zero method

The Connect 4 bot

Connect 4 results

Summary

References

RL in Discrete Optimization

RL's reputation

The Rubik's Cube and combinatorial optimization

Optimality and God's number

Approaches to cube solving

The training process

The model application

The paper's results

The code outline

The experiment results

Further improvements and experiments

Summary

Multi-agent RL

Multi-agent RL explained

The MAgent environment

Deep Q-network for tigers

Collaboration by the tigers

Training both tigers and deer

The battle between equal actors

Summary

Other Books You May Enjoy

Index

Customer Reviews

5 (2)

5 star

100%

4 star

3 star

2 star

1 star

The anatomy of the agent

As you learned in the previous chapter, there are several entities in RL's view of the world:

The agent: A thing, or person, that takes an active role. In practice, the agent is some piece of code that implements some policy. Basically, this policy decides what action is needed at every time step, given our observations.
The environment: Some model of the world that is external to the agent and has the responsibility of providing observations and giving rewards. The environment changes its state based on the agent's actions.

Let's explore how both can be implemented in Python for a simple situation. We will define an environment that will give the agent random rewards for a limited number of steps, regardless of the agent's actions. This scenario is not very useful, but it will allow us to focus on specific methods in both the environment and agent classes. Let's start with the environment:

class Environment:
    def __init__(self):
        self.steps_left = 10

In the preceding code, we allowed the environment to initialize its internal state. In our case, the state is just a counter that limits the number of time steps that the agent is allowed to take to interact with the environment.

    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]

The get_observation()method is supposed to return the current environment's observation to the agent. It is usually implemented as some function of the internal state of the environment. If you're curious about what is meant by-> List[float], that's an example of Python type annotations, which were introduced in Python 3.5. You can find out more in the documentation at https://docs.python.org/3/library/typing.html. In our example, the observation vector is always zero, as the environment basically has no internal state.

    def get_actions(self) -> List[int]:
        return [0, 1]

The get_actions() method allows the agent to query the set of actions it can execute. Normally, the set of actions that the agent can execute does not change over time, but some actions can become impossible in different states (for example, not every move is possible in any position of the tic-tac-toe game). In our simplistic example, there are only two actions that the agent can carry out, which are encoded with the integers 0 and 1.

    def is_done(self) -> bool:
        return self.steps_left == 0

The preceding method signaled the end of the episode to the agent. As you saw in Chapter 1, What Is Reinforcement Learning?, the series of environment-agent interactions is divided into a sequence of steps called episodes. Episodes can be finite, like in a game of chess, or infinite, like the Voyager 2 mission (a famous space probe that was launched over 40 years ago and has traveled beyond our solar system). To cover both scenarios, the environment provides us with a way to detect when an episode is over and there is no way to communicate with it anymore.

    def action(self, action: int) -> float:
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random()

The action() method is the central piece in the environment's functionality. It does two things – handles an agent's action and returns the reward for this action. In our example, the reward is random and its action is discarded. Additionally, we update the count of steps and refuse to continue the episodes that are over.

Now when looking at the agent's part, it is much simpler and includes only two methods: the constructor and the method that performs one step in the environment:

class Agent:
    def __init__(self):
        self.total_reward = 0.0

In the constructor, we initialize the counter that will keep the total reward accumulated by the agent during the episode.

    def step(self, env: Environment):
        current_obs = env.get_observation()
        actions = env.get_actions()
        reward = env.action(random.choice(actions))
        self.total_reward += reward

The step function accepts the environment instance as an argument and allows the agent to perform the following actions:

Observe the environment
Make a decision about the action to take based on the observations
Submit the action to the environment
Get the reward for the current step

For our example, the agent is dull and ignores the observations obtained during the decision-making process about which action to take. Instead, every action is selected randomly. The final piece is the glue code, which creates both classes and runs one episode:

if __name__ == "__main__":
    env = Environment()
    agent = Agent()
    while not env.is_done():
        agent.step(env)
    print("Total reward got: %.4f" % agent.total_reward)

You can find the preceding code in this book's GitHub repository at https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition in the Chapter02/01_agent_anatomy.py file. It has no external dependencies and should work with any more-or-less modern Python version. By running it several times, you'll get different amounts of reward gathered by the agent.

The simplicity of the preceding code illustrates the important basic concepts that come from the RL model. The environment could be an extremely complicated physics model, and an agent could easily be a large neural network (NN) that implements the latest RL algorithm, but the basic pattern will stay the same – on every step, the agent will take some observations from the environment, do its calculations, and select the action to take. The result of this action will be a reward and a new observation.

You may ask, if the pattern is the same, why do we need to write it from scratch? What if it is already implemented by somebody and could be used as a library? Of course, such frameworks exist, but before we spend some time discussing them, let's prepare your development environment.

Deep Reinforcement Learning Hands-On - Second Edition

By : Maxim Lapan

Deep Reinforcement Learning Hands-On - Second Edition

By: Maxim Lapan

Overview of this book

Related Content you might be interested in

Current Title:

Deep Reinforcement Learning Hands-On - Second Edition

TensorFlow Reinforcement Learning Quick Start Guide

Reinforcement Learning Algorithms with Python

Hands-On Reinforcement Learning for Games

The anatomy of the agent