Book Image

Reinforcement Learning with TensorFlow

By : Sayon Dutta
Book Image

Reinforcement Learning with TensorFlow

By: Sayon Dutta

Overview of this book

Reinforcement learning (RL) allows you to develop smart, quick and self-learning systems in your business surroundings. It's an effective method for training learning agents and solving a variety of problems in Artificial Intelligence - from games, self-driving cars and robots, to enterprise applications such as data center energy saving (cooling data centers) and smart warehousing solutions. The book covers major advancements and successes achieved in deep reinforcement learning by synergizing deep neural network architectures with reinforcement learning. You'll also be introduced to the concept of reinforcement learning, its advantages and the reasons why it's gaining so much popularity. You'll explore MDPs, Monte Carlo tree searches, dynamic programming such as policy and value iteration, and temporal difference learning such as Q-learning and SARSA. You will use TensorFlow and OpenAI Gym to build simple neural network models that learn from their own actions. You will also see how reinforcement learning algorithms play a role in games, image processing and NLP. By the end of this book, you will have gained a firm understanding of what reinforcement learning is and understand how to put your knowledge to practical use by leveraging the power of TensorFlow and OpenAI Gym.
Table of Contents (21 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

Policy objective functions


Let's discuss now how to optimize a policy. In policy methods, our main objective is that a given policy 

 with parameter vector 

 finds the best values of the parameter vector. In order to measure which is the best, we measure

 the quality of the policy 

 for different values of the parameter vector 

.

Before discussing the optimization methods, let's first figure out the different ways to measure the quality of a policy 

:

  • If it's an episodic environment, 
     can be the value function of the start state
     that is if it starts from any state 
    , then the value function of it would be the expected sum of reward from that state onwards. Therefore,
  • If it's a continuing environment, 
     can be the average value function of the states. So, if the environment goes on and on forever, then the measure of the quality of the policy can be the summation of the probability of being in any state s that is
     times the value of that state that is, the expected reward from that state onward. Therefore...