Reinforcement Learning with TensorFlow

Reinforcement Learning with TensorFlow

By : Sayon Dutta

Buy this Book

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Buy this Book

Overview of this book

Reinforcement learning (RL) allows you to develop smart, quick and self-learning systems in your business surroundings. It's an effective method for training learning agents and solving a variety of problems in Artificial Intelligence - from games, self-driving cars and robots, to enterprise applications such as data center energy saving (cooling data centers) and smart warehousing solutions. The book covers major advancements and successes achieved in deep reinforcement learning by synergizing deep neural network architectures with reinforcement learning. You'll also be introduced to the concept of reinforcement learning, its advantages and the reasons why it's gaining so much popularity. You'll explore MDPs, Monte Carlo tree searches, dynamic programming such as policy and value iteration, and temporal difference learning such as Q-learning and SARSA. You will use TensorFlow and OpenAI Gym to build simple neural network models that learn from their own actions. You will also see how reinforcement learning algorithms play a role in games, image processing and NLP. By the end of this book, you will have gained a firm understanding of what reinforcement learning is and understand how to put your knowledge to practical use by leveraging the power of TensorFlow and OpenAI Gym.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Deep Learning – Architectures and Frameworks

Deep learning

Reinforcement learning

Introduction to TensorFlow and OpenAI Gym

The pioneers and breakthroughs in reinforcement learning

Summary

Training Reinforcement Learning Agents Using OpenAI Gym

The OpenAI Gym

Programming an agent using an OpenAI Gym environment

Summary

Markov Decision Process

Markov decision processes

Partially observable Markov decision processes

Training the FrozenLake-v0 environment using MDP

Summary

Policy Gradients

The policy optimization method

Why policy optimization methods?

Policy objective functions

Temporal difference rule

Policy gradients

Agent learning pong using policy gradients

Summary

Q-Learning and Deep Q-Networks

Why reinforcement learning?

Model based learning and model free learning

Q-learning

Deep Q-networks

The Monte Carlo tree search algorithm

The SARSA algorithm

Summary

Asynchronous Methods

Why asynchronous methods?

Asynchronous one-step Q-learning

Asynchronous one-step SARSA

Asynchronous n-step Q-learning

Asynchronous advantage actor critic

A3C for Pong-v0 in OpenAI gym

Summary

Robo Everything – Real Strategy Gaming

Real-time strategy games

Reinforcement learning and other approaches

Reinforcement learning in RTS gaming

Summary

AlphaGo – Reinforcement Learning at Its Best

What is Go?

AlphaGo – mastering Go

AlphaGo Zero

Summary

Reinforcement Learning in Autonomous Driving

Machine learning for autonomous driving

Reinforcement learning for autonomous driving

Proposed frameworks for autonomous driving

DeepTraffic – MIT simulator for autonomous driving

Summary

Financial Portfolio Management

Introduction

Problem definition

Data preparation

Reinforcement learning

Further improvements

Summary

Reinforcement Learning in Robotics

Reinforcement learning in robotics

Challenges in robot reinforcement learning

Open questions and practical challenges

Key takeaways

Summary

Deep Reinforcement Learning in Ad Tech

Computational advertising challenges and bidding strategies

Real-time bidding by reinforcement learning in display advertising

Summary

Reinforcement Learning in Image Processing

Hierarchical object detection with deep reinforcement learning

Summary

Deep Reinforcement Learning in NLP

Text summarization

Text question answering

Summary

Further topics in Reinforcement Learning

Continuous action space algorithms

Scoring mechanism in sequential models in NLP

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Scoring mechanism in sequential models in NLP

Two scoring mechanisms were used to evaluate the approaches mentioned in Chapter 14, Deep Reinforcement Learning in NLP, as follows:

BLEU

One of the biggest challenges in sequential models in NLP used in machine translation, text summarization, image captioning, and much more is an adequate metric for evaluation.

Suppose your use case is machine translation; you have a German phrase and there are multiple English translations of it. All of them look equally good. So, how do you evaluate a machine translation system if there are multiple equally good answers? This is unlike image recognition, where the target has only one right answer and not multiple, equally good right answers.

For example:

German sentence: Die Katze ist auf der Matte

A multiple reference human-generated translation of the preceding German sentence is as follows:

The cat is on the mat
There is a cat on the mat

If the target is just one right answer, the accuracy measurement is easy, but if there are multiple equally correct possibilities, then how is the accuracy in such a case measured? In this section, we will study BLEU score, which is an evaluation metric to measure accuracy in such cases of multiple equally correct answers.

What is BLEU score and what does it do?

BLEU score was published by Papineni et. al. 2002 in their research publication named BLEU: a Method for Automatic Evaluation of Machine Translation (https://www.aclweb.org/anthology/P02-1040.pdf). BLEU stands for Bi-Lingual Evaluation Understudy. For a given machine-generated output (say translation in the case of machine translation or summary in the case of text summarization), the score measures the goodness of the output, that is, how much close the machine-generated output is to any of the possible human-generated references (possible actual outputs). Thus, the closer the output text is to any human-generated reference, the higher will be the BLEU score.

The motivation behind BLEU score was to devise a metric that can evaluate machine-generated text with respect to human-generated references just like human evaluators. The intuition behind BLEU score is that it considers the machine-generated output and explores if these words exist in at least one of the multiple human-generated references.

Let's consider the following example:

Input German text: Der Hund ist unter der Decke

Say we have two human-generated references which are as follows:

Reference 1: The dog is under the blanket
Reference 2: There is a dog under the blanket

And say our machine translation generated a terrible output, which is "the the the the the the"

Thus, the precision is given by the following formula:

As such, the following applies:

Since the appears six times in the output and each the appears in at least one of the reference texts, precision is 1.0. The issue arises because of the basic definition of precision, which is defined as the fraction of the predicted output that appears in the actual output (reference). Thus, the occurring in the predicted output is the only text, and since it appears in the references, the resulting precision is 1.0.

Therefore, the definition of precision is modified to get a modified formula where a clip count is put. Here, clip count is the maximum number of times a word appears in any of the references. Thus, modified precision is defined as the maximum number of times a word appears in any of the references divided by the total number of appearances of that word in the machine-generated output.

For the preceding example, the modified precision would be given as:

Till now, we have considered each word in isolated form, that is, in the form of a unigram. In BLEU score, you also want to look at words in pairs and not just in isolation. Let's try to calculate the BLEU score with the bi-gram approach, where bi-gram means a pair of words appearing next to each other.

Let's consider the following example:

Input German text: Der Hund ist unter der Decke

Say we have two human-generated references, which are as follows:

Reference 1: The dog is under the blanket
Reference 2: There is a dog under the blanket

Machine-generated output: The dog the dog the dog under the blanket

Bi-grams in the machine-generated output	Count	Count_clip (maximum occurrences of the bi-gram in any one of the references)
the dog	3	1
dog the	2	0
dog under	1	0
under the	1	1
the blanket	1	1

Therefore, the modified bi-gram precision would be the ratio of the sum of bi-gram count_clips and the sum of bi-gram counts, that is:

Thus, we can create the following precision formulae for uni-grams, bi-grams, and n-grams as follows:

p₁ = precision for uni-grams, where:

p₂ = precision for bi-grams, where:

p_n = precision for n-grams, where:

The modified precisions calculated on uni-grams, bi-grams, or even any n-grams allow you to measure the degree to which the machine-generated output text is similar to the human-generated references. If the machine-generated text is exactly similar to any one of the human-generated references then:

Let's put all the p_iscores together to calculate the final BLEU score for the machine-generated output. Since, p_n is the BLEU score on n-grams only (that is, modified precision on n-grams), the combined BLEU score where n_max = N is given by the following:

BP is called brevity penalty. This penalty comes into the picture if the machine-generated output is very short. This is because in case of short output sequence most of the words occurring in that have a very high chance of appearing in the human-generated references. Thus, brevity penalty acts as an adjustment factor which penalises the machine-generated text when it's shorter than the shortest human-generated output reference for that input.

Brevity penalty (BP) is given by the following formula:

where:

len(MO) = length of the machine-generated output

s_len(REF) = length of the shortest human-generated reference output

For more details, please check the publication on BLEU score by Papineni et. al. 2002 (https://www.aclweb.org/anthology/P02-1040.pdf).

ROUGE

ROUGE stands for Recall Oriented Understudy for Gisting Evaluation. It is also a metric for evaluating sequential models in NLP especially automatic text summarization and machine translation. ROUGE was proposed by CY Lin in the research publication named ROUGE: A Package for Automatic Evaluation of Summaries (http://www.aclweb.org/anthology/W04-1013) in 2004.

ROUGE also works by comparing the machine-generated output(automatic summaries or translation) against a set of human-generated references.

Let's consider the following example:

Machine-generated output: the dog was found under the bed
Human-generated reference: the dog was under the bed

Therefore, precision and recall in the context of ROUGE is shown as follows:

Thus, recall = 6/6 = 1.0.

If recall is 1.0, it means that all the words in the human-generated reference is captured by the machine-generated output. There can be a case that machine-generated output might be extremely long. Therefore, while calculating recall, the long machine-generated output has a high chance to cover most of the human-generated reference words. As a result, precision comes to the rescue, which is computed as shown as follows:

Thus, precision (for the preceding example) = 6/7 = 0.86

Now, if the machine-generated output had been the big black dog was found under the big round bed, then,

This shows that the machine-generated output isn't appropriate since it contains a good amount of unnecessary words. Therefore, we can easily figure out that only recall isn't sufficient, and as a result both recall and precision should be used together for evaluation. Thus, F1-score which is calculated as the harmonic mean of recall and precision, as shown as follows is a good evaluation metric in such cases:

ROUGE-1 refers to the overlap of unigrams between the machine-generated output and human-generated references
ROUGE-2 refers to the overlap of bi-grams between the machine-generated output and human-generated references

Let's understand more about ROUGE-2 with the following example:

Machine-generated output: the dog was found under the bed
Human-generated reference: the dog was under the bed

Bigrams of the machine-generated output that is the dog was found under the bed:

"the cat"

"cat was"

"was found"

"found under"

"under the"

"the bed"

Bigrams of the human-generated reference that is the dog was under the bed:

"the dog"

"dog was"

"was under"

"under the"

"the bed"

Therefore:

Thus, ROUGE-2_Precision shows that 67% of the bi-grams generated by the machine overlap with the human-generated reference.

This appendix covered the basic overview of ROUGE scoring in sequential models in NLP. For further details on ROUGE-N, ROUGE-L and ROUGE-S please go through the research publication of ROUGE: A Package for Automatic Evaluation of Summaries (http://www.aclweb.org/anthology/W04-1013) by CY Lin.

Reinforcement Learning with TensorFlow

By : Sayon Dutta

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Overview of this book

Related Content you might be interested in

Current Title:

Reinforcement Learning with TensorFlow

Hands-On Reinforcement Learning with Python

Python Reinforcement Learning Projects

Hands-On Intelligent Agents with OpenAI Gym

Scoring mechanism in sequential models in NLP

BLEU

What is BLEU score and what does it do?

ROUGE