Reinforcement Learning with TensorFlow

Reinforcement Learning with TensorFlow

By : Sayon Dutta

Buy this Book

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Buy this Book

Overview of this book

Reinforcement learning (RL) allows you to develop smart, quick and self-learning systems in your business surroundings. It's an effective method for training learning agents and solving a variety of problems in Artificial Intelligence - from games, self-driving cars and robots, to enterprise applications such as data center energy saving (cooling data centers) and smart warehousing solutions. The book covers major advancements and successes achieved in deep reinforcement learning by synergizing deep neural network architectures with reinforcement learning. You'll also be introduced to the concept of reinforcement learning, its advantages and the reasons why it's gaining so much popularity. You'll explore MDPs, Monte Carlo tree searches, dynamic programming such as policy and value iteration, and temporal difference learning such as Q-learning and SARSA. You will use TensorFlow and OpenAI Gym to build simple neural network models that learn from their own actions. You will also see how reinforcement learning algorithms play a role in games, image processing and NLP. By the end of this book, you will have gained a firm understanding of what reinforcement learning is and understand how to put your knowledge to practical use by leveraging the power of TensorFlow and OpenAI Gym.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Deep Learning – Architectures and Frameworks

Deep learning

Reinforcement learning

Introduction to TensorFlow and OpenAI Gym

The pioneers and breakthroughs in reinforcement learning

Summary

Training Reinforcement Learning Agents Using OpenAI Gym

The OpenAI Gym

Programming an agent using an OpenAI Gym environment

Summary

Markov Decision Process

Markov decision processes

Partially observable Markov decision processes

Training the FrozenLake-v0 environment using MDP

Summary

Policy Gradients

The policy optimization method

Why policy optimization methods?

Policy objective functions

Temporal difference rule

Policy gradients

Agent learning pong using policy gradients

Summary

Q-Learning and Deep Q-Networks

Why reinforcement learning?

Model based learning and model free learning

Q-learning

Deep Q-networks

The Monte Carlo tree search algorithm

The SARSA algorithm

Summary

Asynchronous Methods

Why asynchronous methods?

Asynchronous one-step Q-learning

Asynchronous one-step SARSA

Asynchronous n-step Q-learning

Asynchronous advantage actor critic

A3C for Pong-v0 in OpenAI gym

Summary

Robo Everything – Real Strategy Gaming

Real-time strategy games

Reinforcement learning and other approaches

Reinforcement learning in RTS gaming

Summary

AlphaGo – Reinforcement Learning at Its Best

What is Go?

AlphaGo – mastering Go

AlphaGo Zero

Summary

Reinforcement Learning in Autonomous Driving

Machine learning for autonomous driving

Reinforcement learning for autonomous driving

Proposed frameworks for autonomous driving

DeepTraffic – MIT simulator for autonomous driving

Summary

Financial Portfolio Management

Introduction

Problem definition

Data preparation

Reinforcement learning

Further improvements

Summary

Reinforcement Learning in Robotics

Reinforcement learning in robotics

Challenges in robot reinforcement learning

Open questions and practical challenges

Key takeaways

Summary

Deep Reinforcement Learning in Ad Tech

Computational advertising challenges and bidding strategies

Real-time bidding by reinforcement learning in display advertising

Summary

Reinforcement Learning in Image Processing

Hierarchical object detection with deep reinforcement learning

Summary

Deep Reinforcement Learning in NLP

Text summarization

Text question answering

Summary

Further topics in Reinforcement Learning

Continuous action space algorithms

Scoring mechanism in sequential models in NLP

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Continuous action space algorithms

There are many continuous action space algorithms in deep reinforcement learning topology. Some of them, which we covered earlier in Chapter 4, Policy Gradients, were mainly stochastic policy gradients and stochastic actor-critic algorithms. Stochastic policy gradients were associated with many problems such as difficulty in choosing step size owing to the non-stationary data due to continuous change in observation and reward distribution, where a bad step would adversely affect the learning of the policy network parameters. Therefore, there was a need for an approach that can restrict this policy search space and avoid bad steps while training the policy network parameters.

Here, we will try to cover some of the advanced continuous action space algorithms:

Trust region policy optimization
Deterministic policy gradients

Trust region policy optimization

Trustregion policy optimization (TRPO) is an iterative approach for optimizing policies. TRPO optimizes large nonlinear policies. TRPO restricts the policy search space by applying constraints on the output policy distributions. In order to do this, KL divergence loss function (

) is used on the policy network parameters to penalize these parameters. This KL divergence constraint between the new and the old policy is called the trust region constraint. As a result of this constraint large scale changes don't occur in the policy distribution, thereby resulting in early convergence of the policy network.

TRPO was published by Schulman et. al. 2017 in the research publication named Trust Region Policy Optimization (https://arxiv.org/pdf/1502.05477.pdf). Here they have mention the experiments demonstrating the robust performance of TRPO on different tasks such as learning simulated robotic swimming, playing Atari games, and many more. In order to study TRPO in detail, please follow the arXiv link of the publication: https://arxiv.org/pdf/1502.05477.pdf.

Deterministic policy gradients

Deterministic policy gradients was proposed by Silver et. al. in the publication named Deterministic Policy Gradient Algorithms (http://proceedings.mlr.press/v32/silver14.pdf). In continuous action spaces, policy improvement with greedy approach becomes difficult and requires global optimization. Therefore, it is better and tractable to update the policy network parameters in the direction of the gradient of the Q function, as follows:

where,

is the deterministic policy, α is the learning rate and θ representing the policy network parameters. By applying the chain rule, the policy improvement can be shown as follows:

The preceding update rule can be incorporated into a policy networks where the parameters are updated using stochastic gradient ascent. This can be realized as a deterministic actor-critic method where the critic estimates the action-value function while the actor derives its gradients from the critic to update its parameters. As mentioned in Deterministic Policy Gradient Algorithms (http://proceedings.mlr.press/v32/silver14.pdf) by Silver et. al., post experimentation, they were able to successfully conclude that the deterministic policy gradients are more efficient than their stochastic counterparts. Moreover, deterministic actor-critic outperformed its stochastic counterpart by a significant margin. A detailed explanation of this topic is out of the scope of this book. So please go to the research publication link mentioned previously.

Reinforcement Learning with TensorFlow

By : Sayon Dutta

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Overview of this book

Related Content you might be interested in

Current Title:

Reinforcement Learning with TensorFlow

Hands-On Reinforcement Learning with Python

Python Reinforcement Learning Projects

Hands-On Intelligent Agents with OpenAI Gym

Continuous action space algorithms

Trust region policy optimization

Deterministic policy gradients