# Temporal-difference learning

Q-Learning is a special case of a more generalized **Temporal-Difference Learning** or **TD-Learning** . More specifically, it's a special case of one-step TD-Learning *TD*(0):

(Equation 9.5.1)

In the equation is the learning rate. We should note that when , *Equation* *9.5.1* is similar to the Bellman equation. For simplicity, we'll refer to *Equation* *9.5.1* as Q-Learning or generalized Q-Learning.

Previously, we referred to Q-Learning as an off-policy RL algorithm since it learns the Q value function without directly using the policy that it is trying to optimize. An example of an *on-policy* one-step TD-learning algorithm is SARSA which similar to *Equation* *9.5.1*:

(Equation 9.5.2)

The main difference is the use of the policy that is being optimized to determine *a**'*. The terms *s*, *a*, *r*, *s**'* and *a**'* (thus the name SARSA) must be known to update the *Q* value function at every iteration. Both Q-Learning and SARSA use existing estimates...