# The Q value

An important question is that if the RL problem is to find , how does the agent learn by interacting with the environment? *Equation *
*9.1.3* does not explicitly indicate the action to try and the succeeding state to compute the return. In RL, we find that it's easier to learn by using the *Q* value:

(Equation 9.2.1)

Where:

(Equation 9.2.2)

In other words, instead of finding the policy that maximizes the value for all states, *Equation 9.2.1* looks for the action that maximizes the quality (*Q*) value for all states. After finding the *Q* value function, *V** and hence are determined by *Equation 9.2.2* and *9.1.3* respectively.

If for every action, the reward and the next state can be observed, we can formulate the following iterative or trial and error algorithm to learn the *Q* value:

(Equation 9.2.3)

For notational simplicity, both *s*
*'* and *a*
*'* are the next state and action respectively. *Equation 9.2.3* is known as the **Bellman Equation** which is the core...