TensorFlow Reinforcement Learning Quick Start Guide

By : Kaushik Balakrishnan

TensorFlow Reinforcement Learning Quick Start Guide

By: Kaushik Balakrishnan

Overview of this book

Advances in reinforcement learning algorithms have made it possible to use them for optimal control in several different industrial applications. With this book, you will apply Reinforcement Learning to a range of problems, from computer games to autonomous driving. The book starts by introducing you to essential Reinforcement Learning concepts such as agents, environments, rewards, and advantage functions. You will also master the distinctions between on-policy and off-policy algorithms, as well as model-free and model-based algorithms. You will also learn about several Reinforcement Learning algorithms, such as SARSA, Deep Q-Networks (DQN), Deep Deterministic Policy Gradients (DDPG), Asynchronous Advantage Actor-Critic (A3C), Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO). The book will also show you how to code these algorithms in TensorFlow and Python and apply them to solve computer games from OpenAI Gym. Finally, you will also learn how to train a car to drive autonomously in the Torcs racing car simulator. By the end of the book, you will be able to design, build, train, and evaluate feed-forward neural networks and convolutional neural networks. You will also have mastered coding state-of-the-art algorithms and also training agents for various control problems.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Up and Running with Reinforcement Learning

Why RL?

The relationship between an agent and its environment

Identifying episodes

Identifying reward functions and the concept of discounted rewards

Learning the Markov decision process

Defining the Bellman equation

On-policy versus off-policy learning

Model-free and model-based training

Algorithms covered in this book

Summary

Questions

Further reading

Temporal Difference, SARSA, and Q-Learning

Technical requirements

Understanding TD learning

Understanding SARSA and Q-Learning

Cliff walking and grid world problems

Summary

Further reading

Deep Q-Network

Technical requirements

Learning the theory behind a DQN

Understanding target networks

Learning about replay buffer

Getting introduced to the Atari environment

Summary

Questions

Further reading

Double DQN, Dueling Architectures, and Rainbow

Technical requirements

Understanding Double DQN

Understanding dueling network architectures

Understanding Rainbow networks

Running a Rainbow network on Dopamine

Summary

Questions

Further reading

Deep Deterministic Policy Gradient

Technical requirements

Actor-Critic algorithms and policy gradients

Deep Deterministic Policy Gradient

Training and testing the DDPG on Pendulum-v0

Summary

Questions

Further reading

Asynchronous Methods - A3C and A2C

Technical requirements

The A3C algorithm

The A3C algorithm applied to CartPole

The A3C algorithm applied to LunarLander

The A2C algorithm

Summary

Questions

Further reading

Trust Region Policy Optimization and Proximal Policy Optimization

Technical requirements

Learning TRPO

Learning PPO

Using PPO to solve the MountainCar problem

Evaluating the performance

Summary

Questions

Further reading

Deep RL Applied to Autonomous Driving

Technical requirements

Car driving simulators

Learning to use TORCS

Training a DDPG agent to learn to drive

Summary

Assessment

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The relationship between an agent and its environment

At a very basic level, RL involves an agent and an environment. An agent is an artificial intelligence entity that has certain goals, must remain vigilant about things that can come in the way of these goals, and must, at the same time, pursue the things that help in the attaining of these goals. An environment is everything that the agent can interact with. Let me explain further with an example that involves an industrial mobile robot.

For example, in a setting involving an industrial mobile robot navigating inside a factory, the robot is the agent, and the factory is the environment.

The robot has certain pre-defined goals, for example, to move goods from one side of the factory to the other without colliding with obstacles such as walls and/or other robots. The environment is the region available for the robot to navigate and includes all the places the robot can go to, including the obstacles that the robot could crash in to. So the primary task of the robot, or more precisely, the agent, is to explore the environment, understand how the actions it takes affects its rewards, be cognizant of the obstacles that can cause catastrophic crashes or failures, and then master the art of maximizing the goals and improving its performance over time.

In this process, the agent inevitably interacts with the environment, which can be good for the agent regarding certain tasks, but could be bad for the agent regarding other tasks. So, the agent must learn how the environment will respond to the actions that are taken. This is a trial-and-error learning approach, and only after numerous such trials can the agent learn how the environment will respond to its decisions.

Let's now come to understand what the state space of an agent is, and the actions that the agent performs to explore the environment.

Defining the states of the agent

In RL parlance, states represent the current situation of the agent. For example, in the previous industrial mobile robot agent case, the state at a given time instant is the location of the robot inside the factory – that is, where it is located, its orientation, or more precisely, the pose of the robot. For a robot that has joints and effectors, the state can also include the precise location of the joints and effectors in a three-dimensional space. For an autonomous car, its state can represent its speed, location on a map, distance to other obstacles, torques on its wheels, the rpm of the engine, and so on.

States are usually deduced from sensors in the real world; for instance, the measurement from odometers, LIDARs, radars, and cameras. States can be a one-dimensional vector of real numbers or integers, or two-dimensional camera images, or even higher-dimensional, for instance, three-dimensional voxels. There are really no precise limitations on states, and the state just represents the current situation of the agent.

In RL literature, states are typically represented as s_t, where the subscript t is used to denote the time instant corresponding to the state.

Defining the actions of the agent

The agent performs actions to explore the environment. Obtaining this action vector is the primary goal in RL. Ideally, you need to strive to obtain optimal actions.

An action is the decision an agent takes in a certain state, s_t. Typically, it is represented as a_t, where, as before, the subscript t denotes the time instant. The actions that are available to an agent depends on the problem. For instance, an agent in a maze can decide to take a step north, or south, or east, or west. These are called discrete actions, as there are a fixed number of possibilities. On the other hand, for an autonomous car, actions can be the steering angle, throttle value, brake value, and so on, which are called continuous actions as they can take real number values in a bounded range. For example, the steering angle can be 40 degrees from the north-south line, and the throttle can be 60% down, and so on.

Thus, actions a_t can be either discrete or continuous, depending on the problem at hand. Some RL approaches handle discrete actions, while others are suited for continuous actions.

A schematic of the agent and its interaction with the environment is shown in the following diagram:

Figure 1: Schematic showing the agent and its interaction with the environment

Now that we know what an agent is, we will look at the policies that the agent learns, what value and advantage functions are, and how these quantities are used in RL.

Understanding policy, value, and advantage functions

A policy defines the guidelines for an agent's behavior at a given state. In mathematical terms, a policy is a mapping from a state of the agent to the action to be taken at that state. It is like a stimulus-response rule that the agent follows as it learns to explore the environment. In RL literature, it is usually denoted as π(a_t|s_t) – that is, it is a conditional probability distribution of taking an action a_t in a given state s_t. Policies can be deterministic, wherein the exact value of a_t is known at s_t, or can be stochastic where a_t is sampled from a distribution – typically this is a Gaussian distribution, but it can also be any other probability distribution.

In RL, value functions are used to define how good a state of an agent is. They are typically denoted by V(s) at state s and represent the expected long-term average rewards for being in that state. V(s) is given by the following expression where E[.] is an expectation over samples:

Note that V(s) does not care about the optimum actions that an agent needs to take at the state s. Instead, it is a measure of how good a state is. So, how can an agent figure out the most optimum action a_t to take in a given state s_t at time instant t? For this, you can also define an action-value function given by the following expression:

Note that Q(s,a) is a measure of how good is it to take action a in state s and follow the same policy thereafter. So, t is different from V(s), which is a measure of how good a given state is. We will see in the following chapters how the value function is used to train the agent under the RL setting.

The advantage function is defined as the following:

A(s,a) = Q(s,a) - V(s)

This advantage function is known to reduce the variance of policy gradients, a topic that will be discussed in depth in a later chapter.

The classic RL textbook is Reinforcement Learning: An Introduction by Richard S Sutton and Andrew G Barto, The MIT Press, 1998.

We will now define what an episode is and its significance in an RL context.

TensorFlow Reinforcement Learning Quick Start Guide

By : Kaushik Balakrishnan

TensorFlow Reinforcement Learning Quick Start Guide

By: Kaushik Balakrishnan

Overview of this book

Related Content you might be interested in

Current Title:

TensorFlow Reinforcement Learning Quick Start Guide

Reinforcement Learning Algorithms with Python

Hands-On Reinforcement Learning with Python

Hands-On Intelligent Agents with OpenAI Gym