Reinforcement Learning with TensorFlow

Reinforcement Learning with TensorFlow

By : Sayon Dutta

Buy this Book

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Buy this Book

Overview of this book

Reinforcement learning (RL) allows you to develop smart, quick and self-learning systems in your business surroundings. It's an effective method for training learning agents and solving a variety of problems in Artificial Intelligence - from games, self-driving cars and robots, to enterprise applications such as data center energy saving (cooling data centers) and smart warehousing solutions. The book covers major advancements and successes achieved in deep reinforcement learning by synergizing deep neural network architectures with reinforcement learning. You'll also be introduced to the concept of reinforcement learning, its advantages and the reasons why it's gaining so much popularity. You'll explore MDPs, Monte Carlo tree searches, dynamic programming such as policy and value iteration, and temporal difference learning such as Q-learning and SARSA. You will use TensorFlow and OpenAI Gym to build simple neural network models that learn from their own actions. You will also see how reinforcement learning algorithms play a role in games, image processing and NLP. By the end of this book, you will have gained a firm understanding of what reinforcement learning is and understand how to put your knowledge to practical use by leveraging the power of TensorFlow and OpenAI Gym.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Deep Learning – Architectures and Frameworks

Deep learning

Reinforcement learning

Introduction to TensorFlow and OpenAI Gym

The pioneers and breakthroughs in reinforcement learning

Summary

Training Reinforcement Learning Agents Using OpenAI Gym

The OpenAI Gym

Programming an agent using an OpenAI Gym environment

Summary

Markov Decision Process

Markov decision processes

Partially observable Markov decision processes

Training the FrozenLake-v0 environment using MDP

Summary

Policy Gradients

The policy optimization method

Why policy optimization methods?

Policy objective functions

Temporal difference rule

Policy gradients

Agent learning pong using policy gradients

Summary

Q-Learning and Deep Q-Networks

Why reinforcement learning?

Model based learning and model free learning

Q-learning

Deep Q-networks

The Monte Carlo tree search algorithm

The SARSA algorithm

Summary

Asynchronous Methods

Why asynchronous methods?

Asynchronous one-step Q-learning

Asynchronous one-step SARSA

Asynchronous n-step Q-learning

Asynchronous advantage actor critic

A3C for Pong-v0 in OpenAI gym

Summary

Robo Everything – Real Strategy Gaming

Real-time strategy games

Reinforcement learning and other approaches

Reinforcement learning in RTS gaming

Summary

AlphaGo – Reinforcement Learning at Its Best

What is Go?

AlphaGo – mastering Go

AlphaGo Zero

Summary

Reinforcement Learning in Autonomous Driving

Machine learning for autonomous driving

Reinforcement learning for autonomous driving

Proposed frameworks for autonomous driving

DeepTraffic – MIT simulator for autonomous driving

Summary

Financial Portfolio Management

Introduction

Problem definition

Data preparation

Reinforcement learning

Further improvements

Summary

Reinforcement Learning in Robotics

Reinforcement learning in robotics

Challenges in robot reinforcement learning

Open questions and practical challenges

Key takeaways

Summary

Deep Reinforcement Learning in Ad Tech

Computational advertising challenges and bidding strategies

Real-time bidding by reinforcement learning in display advertising

Summary

Reinforcement Learning in Image Processing

Hierarchical object detection with deep reinforcement learning

Summary

Deep Reinforcement Learning in NLP

Text summarization

Text question answering

Summary

Further topics in Reinforcement Learning

Continuous action space algorithms

Scoring mechanism in sequential models in NLP

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reinforcement learning

Reinforcement learning is a branch of artificial intelligence that deals with an agent that perceives the information of the environment in the form of state spaces and action spaces, and acts on the environment thereby resulting in a new state and receiving a reward as feedback for that action. This received reward is assigned to the new state. Just like when we had to minimize the cost function in order to train our neural network, here the reinforcement learning agent has to maximize the overall reward to find the the optimal policy to solve a particular task.

How this is different from supervised and unsupervised learning?

In supervised learning, the training dataset has input features, X, and their corresponding output labels, Y. A model is trained on this training dataset, to which test cases having input features, X', are given as the input and the model predicts Y'.

In unsupervised learning, input features, X, of the training set are given for the training purpose. There are no associated Y values. The goal is to create a model that learns to segregate the data into different clusters by understanding the underlying pattern and thereby, classifying them to find some utility. This model is then further used for the input features X' to predict their similarity to one of the clusters.

Reinforcement learning is different from both supervised and unsupervised. Reinforcement learning can guide an agent on how to act in the real world. The interface is broader than the training vectors, like in supervised or unsupervised learning. Here is the entire environment, which can be real or a simulated world. Agents are trained in a different way, where the objective is to reach a goal state, unlike the case of supervised learning where the objective is to maximize the likelihood or minimize cost.

Reinforcement learning agents automatically receive the feedback, that is, rewards from the environment, unlike in supervised learning where labeling requires time-consuming human effort. One of the bigger advantage of reinforcement learning is that phrasing any task's objective in the form of a goal helps in solving a wide variety of problems. For example, the goal of a video game agent would be to win the game by achieving the highest score. This also helps in discovering new approaches to achieving the goal. For example, when AlphaGo became the world champion in Go, it found new, unique ways of winning.

A reinforcement learning agent is like a human. Humans evolved very slowly; an agent reinforces, but it can do that very fast. As far as sensing the environment is concerned, neither humans nor and artificial intelligence agents can sense the entire world at once. The perceived environment creates a state in which agents perform actions and land in a new state, that is, a newly-perceived environment different from the earlier one. This creates a state space that can be finite as well as infinite.

The largest sector interested in this technology is defense. Can reinforcement learning agents replace soldiers that not only walk, but fight, and make important decisions?

Basic terminologies and conventions

The following are the basic terminologies associated with reinforcement learning:

Agent: This we create by programming such that it is able to sense the environment, perform actions, receive feedback, and try to maximize rewards.
Environment: The world where the agent resides. It can be real or simulated.
State: The perception or configuration of the environment that the agent senses. State spaces can be finite or infinite.
Rewards: Feedback the agent receives after any action it has taken. The goal of the agent is to maximize the overall reward, that is, the immediate and the future reward. Rewards are defined in advance. Therefore, they must be created properly to achieve the goal efficiently.
Actions: Anything that the agent is capable of doing in the given environment. Action space can be finite or infinite.
SAR triple: (state, action, reward) is referred as the SAR triple, represented as (s, a, r).
Episode: Represents one complete run of the whole task.

Let's deduce the convention shown in the following diagram:

Every task is a sequence of SAR triples. We start from state S(t), perform action A(t) and thereby, receive a reward R(t+1), and land on a new state S(t+1). The current state and action pair gives rewards for the next step. Since, S(t) and A(t) results in S(t+1), we have a new triple of (current state, action, new state), that is, [S(t),A(t),S(t+1)] or (s,a,s').

Optimality criteria

The optimality criteria are a measure of goodness of fit of the model created over the data. For example, in supervised classification learning algorithms, we have maximum likelihood as the optimality criteria. Thus, on the basis of the problem statement and objective optimality criteria differs. In reinforcement learning, our major goal is to maximize the future rewards. Therefore, we have two different optimality criteria, which are:

Value function: To quantify a state on the basis of future probable rewards
Policy: To guide an agent on what action to take in a given state

We will discuss both of them in detail in the coming topics.

The value function for optimality

Agents should be able to think about both immediate and future rewards. Therefore, a value is assigned to each encountered state that reflects this future information too. This is called value function. Here comes the concept of delayed rewards, where being at present what actions taken now will lead to potential rewards in future.

V(s), that is, value of the state is defined as the expected value of rewards to be received in future for all the actions taken from this state to subsequent states until the agent reaches the goal state. Basically, value functions tell us how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each (s,a,s') triple is fixed. This is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

One solution comes in mind, instead of the value function, why don't we store the knowledge of every possible state?

The answer is simple: it's time-consuming and expensive, and this cost grows exponentially. Therefore, it's better to store the knowledge of the current state, that is, V(s):

V(s) = E[all future rewards discounted | S(t)=s]

More details on the value function will be covered in Chapter 3, The Markov Decision Process and Partially Observable MDP.

The policy model for optimality

Policy is defined as the model that guides the agent with action selection in different states. Policy is denoted as

is basically the probability of a certain action given a particular state:

Thus, a policy map will provide the set of probabilities of different actions given a particular state. The policy along with the value function create a solution that helps in agent navigation as per the policy and the calculated value of the state.

The Q-learning approach to reinforcement learning

Q-learning is an attempt to learn the value Q(s,a) of a specific action given to the agent in a particular state. Consider a table where the number of rows represent the number of states, and the number of columns represent the number of actions. This is called a Q-table. Thus, we have to learn the value to find which action is the best for the agent in a given state.

Steps involved in Q-learning:

Initialize the table of Q(s,a) with uniform values (say, all zeros).
Observe the current state, s
Choose an action, a, by epsilon greedy or any other action selection policies, and take the action
As a result,a reward, r, is received and a new state, s', is perceived
Update the Q value of the (s,a) pair in the table by using the following Bellman equation:

,where

is the discounting factor

Then, set the value of current state as a new state and repeat the process to complete one episode, that is, reaches the terminal state
Run multiple episodes to train the agent

To simplify, we can say that the Q-value for a given state, s, and action, a, is updated by the sum of current reward, r, and the discounted (

) maximum Q value for the new state among all its actions. The discount factor delays the reward from the future compared to the present rewards. For example, a reward of 100 today will be worth more than 100 in the future. Similarly, a reward of 100 in the future must be worth less than 100 today. Therefore, we will discount the future rewards. Repeating this update process continuously results in Q-table values converging to accurate measures of the expected future reward for a given action in a given state.

When the volume of the state and action spaces increase, maintaining a Q-table is difficult. In the real world, the state spaces are infinitely large. Thus, there's a requirement of another approach that can produce Q(s,a) without a Q-table. One solution is to replace the Q-table with a function. This function will take the state as the input in the form of a vector, and output the vector of Q-values for all the actions in the given state. This function approximator can be represented by a neural network to predict the Q-values. Thus, we can add more layers and fit in a deep neural network for better prediction of Q-values when the state and action space becomes large, which seemed impossible with a Q-table. This gives rise to the Q-network and if a deeper neural network, such as a convolutional neural network, is used then it results in a deep Q-network(DQN).

More details on Q-learning and deep Q-networks will be covered inChapter 5, Q-Learning and Deep Q-Networks.

Asynchronous advantage actor-critic

The A3C algorithm was published in June 2016 by the combined team of Google DeepMind and MILA. It is simpler and has a lighter framework that used the asynchronous gradient descent to optimize the deep neural network. It was faster and was able to show good results on the multi-core CPU instead of GPU. One of A3C's big advantages is that it can work on continuous as well as discrete action spaces. As a result, it has opened the gateway for many new challenging problems that have complex state and action spaces.

We will discuss it at a high note here, but we will dig deeper in Chapter 6, Asynchronous Methods. Let's start with the name, that is, asynchronous advantage actor-critic (A3C) algorithm and unpack it to get the basic overview of the algorithm:

Asynchronous: In DQN, you remember we used a neural network with our agent to predict actions. This means there is one agent and it's interacting with one environment. What A3C does is create multiple copies of the agent-environment to make the agent learn more efficiently. A3C has a global network, and multiple worker agents, where each agent has its own set of network parameters and each of them interact with their copy of the environment simultaneously without interacting with another agent's environment. The reason this works better than a single agent is that the experience of each agent is independent of the experience of the other agents. Thus, the overall experience from all the worker agents results in diverse training.
Actor-critic: Actor-critic combines the benefits of both value iteration and policy iteration. Thus, the network will estimate both a value function, V(s), and a policy, π(s), for a given state, s. There will be two separate fully-connected layers at the top of the function approximator neural network that will output the value and policy of the state, respectively. The agent uses the value, which acts as a critic to update the policy, that is, the intelligent actor.
Advantage: Policy gradients used discounted returns telling the agent whether the action was good or bad. Replacing that with Advantage not only quantifies the the good or bad status of the action but helps in encouraging and discouraging actions better(we will discuss this in Chapter 4, Policy Gradients).

Reinforcement Learning with TensorFlow

By : Sayon Dutta

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Overview of this book

Related Content you might be interested in

Current Title:

Reinforcement Learning with TensorFlow

Hands-On Reinforcement Learning with Python

Python Reinforcement Learning Projects

Hands-On Intelligent Agents with OpenAI Gym

Reinforcement learning

Basic terminologies and conventions

Optimality criteria

The value function for optimality

The policy model for optimality

The Q-learning approach to reinforcement learning

Asynchronous advantage actor-critic