Book Image

Deep Reinforcement Learning Hands-On - Second Edition

By : Maxim Lapan
Book Image

Deep Reinforcement Learning Hands-On - Second Edition

By: Maxim Lapan

Overview of this book

Deep Reinforcement Learning Hands-On, Second Edition is an updated and expanded version of the bestselling guide to the very latest reinforcement learning (RL) tools and techniques. It provides you with an introduction to the fundamentals of RL, along with the hands-on ability to code intelligent learning agents to perform a range of practical tasks. With six new chapters devoted to a variety of up-to-the-minute developments in RL, including discrete optimization (solving the Rubik's Cube), multi-agent methods, Microsoft's TextWorld environment, advanced exploration techniques, and more, you will come away from this book with a deep understanding of the latest innovations in this emerging field. In addition, you will gain actionable insights into such topic areas as deep Q-networks, policy gradient methods, continuous control problems, and highly scalable, non-gradient methods. You will also discover how to build a real hardware robot trained with RL for less than $100 and solve the Pong environment in just 30 minutes of training using step-by-step code optimization. In short, Deep Reinforcement Learning Hands-On, Second Edition, is your companion to navigating the exciting complexities of RL as it helps you attain experience and knowledge through real-world examples.
Table of Contents (28 chapters)
Other Books You May Enjoy

Extra Gym functionality – wrappers and monitors

What we have discussed so far covers two-thirds of the Gym core API and the essential functions required to start writing agents. The rest of the API you can live without, but it will make your life easier and your code cleaner. So, let's briefly cover the rest of the API.


Very frequently, you will want to extend the environment's functionality in some generic way. For example, imagine an environment gives you some observations, but you want to accumulate them in some buffer and provide to the agent the N last observations. This is a common scenario for dynamic computer games, when one single frame is just not enough to get the full information about the game state. Another example is when you want to be able to crop or preprocess an image's pixels to make it more convenient for the agent to digest, or if you want to normalize reward scores somehow. There are many such situations that have the same structure – you want to "wrap" the existing environment and add some extra logic for doing something. Gym provides a convenient framework for these situations – the Wrapper class.

The class structure is shown in the following diagram.

\\\All_Books\2018\Working_Titles\Books2018\9471_Deep Reinforcement Learning Hands-On\Current-Titles\Chapter02\Graphics\B09471_02_04.png

Figure 2.4: The hierarchy of Wrapper classes in Gym

The Wrapper class inherits the Env class. Its constructor accepts the only argument – the instance of the Env class to be "wrapped." To add extra functionality, you need to redefine the methods you want to extend, such as step() or reset(). The only requirement is to call the original method of the superclass.

To handle more specific requirements, such as a Wrapper class that wants to process only observations from the environment, or only actions, there are subclasses of Wrapper that allow the filtering of only a specific portion of information. They are as follows:

  • ObservationWrapper: You need to redefine the observation (obs) method of the parent. The obs argument is an observation from the wrapped environment, and this method should return the observation that will be given to the agent.
  • RewardWrapper: This exposes the reward (rew) method, which can modify the reward value given to the agent.
  • ActionWrapper: You need to override the action (act) method, which can tweak the action passed to the wrapped environment by the agent.

To make it slightly more practical, let's imagine a situation where we want to intervene in the stream of actions sent by the agent and, with a probability of 10%, replace the current action with a random one. It might look like an unwise thing to do, but this simple trick is one of the most practical and powerful methods for solving the exploration/exploitation problem that I mentioned briefly in Chapter 1, What Is Reinforcement Learning?. By issuing the random actions, we make our agent explore the environment and from time to time drift away from the beaten track of its policy. This is an easy thing to do using the ActionWrapper class (a full example is in Chapter02/

import gym
from typing import TypeVar
import random
Action = TypeVar('Action')
class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env, epsilon=0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon

Here, we initialized our wrapper by calling a parent's __init__ method and saving epsilon (the probability of a random action).

    def action(self, action: Action) -> Action:
        if random.random() < self.epsilon:
            return self.env.action_space.sample()
        return action

This is a method that we need to override from a parent's class to tweak the agent's actions. Every time we roll the die, and with the probability of epsilon, we sample a random action from the action space and return it instead of the action the agent has sent to us. Note that using action_space and wrapper abstractions, we were able to write abstract code, which will work with any environment from Gym. Additionally, we must print the message every time we replace the action, just to verify that our wrapper is working. In the production code, of course, this won't be necessary.

if __name__ == "__main__":
    env = RandomActionWrapper(gym.make("CartPole-v0"))

Now it's time to apply our wrapper. We will create a normal CartPole environment and pass it to our Wrapper constructor. From here on, we will use our wrapper as a normal Env instance, instead of the original CartPole. As the Wrapper class inherits the Env class and exposes the same interface, we can nest our wrappers in any combination we want. This is a powerful, elegant, and generic solution.

    obs = env.reset()
    total_reward = 0.0
    while True:
        obs, reward, done, _ = env.step(0)
        total_reward += reward
        if done:
    print("Reward got: %.2f" % total_reward)

Here is almost the same code, except that every time we issue the same action, 0, our agent is dull and does the same thing. By running the code, you should see that the wrapper is indeed working.

rl_book_samples/Chapter02$ python
Reward got: 12.00

If you want, you can play with the epsilon parameter on the wrapper's creation and verify that randomness improves the agent's score on average.

We should move on now and look at another interesting gem that is hidden inside Gym: Monitor.


Another class that you should be aware of is Monitor. It is implemented like Wrapper and can write information about your agent's performance in a file, with an optional video recording of your agent in action. Some time ago, it was possible to upload the result of the Monitor class' recording to the website and see your agent's position in comparison to other people's results (see the following screenshot), but, unfortunately, at the end of August 2017, OpenAI decided to shut down this upload functionality and froze all the results. There are several alternatives to the original website, but they are not ready yet. I hope this situation will be resolved soon, but at the time of writing, it is not possible to check your results against those of others.

Just to give you an idea of how the Gym web interface looked, here is the CartPole environment leaderboard:

\\\All_Books\2018\Working_Titles\Books2018\9471_Deep Reinforcement Learning Hands-On\Current-Titles\Chapter02\Graphics\B09471_02_05.jpg

Figure 2.5: The Gym web interface with CartPole submissions

Every submission in the web interface had details about training dynamics. For example, the following is my solution for one of Doom's mini-games:

\\\All_Books\2018\Working_Titles\Books2018\9471_Deep Reinforcement Learning Hands-On\Current-Titles\Chapter02\Graphics\B09471_02_06.jpg

Figure 2.6: Submission dynamics for the DoomDefendLine environment

Despite this, Monitor is still useful, as you can take a look at your agent's life inside the environment. So, here is how we add Monitor to our random CartPole agent, which is the only difference (the entire code is in Chapter02/

if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    env = gym.wrappers.Monitor(env, "recording")

The second argument that we pass to Monitor is the name of the directory that it will write the results to. This directory shouldn't exist, otherwise your program will fail with an exception (to overcome this, you could either remove the existing directory or pass the force=True argument to the Monitor class' constructor).

The Monitor class requires the FFmpeg utility to be present on the system, which is used to convert captured observations into an output video file. This utility must be available, otherwise Monitor will raise an exception. The easiest way to install FFmpeg is using your system's package manager, which is OS distribution-specific.

To start this example, one of these three extra prerequisites should be met:

  • The code should be run in an X11 session with the OpenGL extension (GLX)
  • The code should be started in an Xvfb virtual display
  • You can use X11 forwarding in an SSH connection

The reason for this is video recording, which is done by taking screenshots of the window drawn by the environment. Some of the environment uses OpenGL to draw its picture, so the graphical mode with OpenGL needs to be present. This could be a problem for a virtual machine in the cloud, which physically doesn't have a monitor and graphical interface running. To overcome this, there is a special "virtual" graphical display, called Xvfb (X11 virtual framebuffer), which basically starts a virtual graphical display on the server and forces the program to draw inside it. This would be enough to make Monitor happily create the desired videos.

To start your program in the Xvfb environment, you need to have it installed on your machine (this usually requires installing the xvfb package) and run the special script, xvfb-run:

$ xvfb-run -s "-screen 0 640x480x24" python [2017-09-22 12:22:23,446] Making new env: CartPole-v0
[2017-09-22 12:22:23,451] Creating monitor directory recording
[2017-09-22 12:22:23,570] Starting new video recorder writing to recording/
Episode done in 14 steps, total reward 14.00
[2017-09-22 12:22:26,290] Finished writing results. You can upload them to the scoreboard via gym.upload('recording')

As you may see from the preceding log, the video has been written successfully, so you can peek inside one of your agent's sections by playing it.

Another way to record your agent's actions is to use SSH X11 forwarding, which uses the SSH ability to tunnel X11 communications between the X11 client (Python code that wants to display some graphical information) and X11 server (software that knows how to display this information and has access to your physical display).

In X11 architecture, the client and the server are separated and can work on different machines. To use this approach, you need the following:

  1. An X11 server running on your local machine. Linux comes with an X11 server as a standard component (all desktop environments use X11). On a Windows machine, you can set up third-party X11 implementations, such as open source VcXsrv (available in
  2. The ability to log into your remote machine via SSH, passing the –X command-line option: ssh –X servername. This enables X11 tunneling and allows all processes started in this session to use your local display for graphics output.

Then, you can start a program that uses the Monitor class and it will display the agent's actions, capturing the images in a video file.