In the last section, we learned why it is more beneficial to choose an action based on the distribution of return than to choose an action based on the Q value, which is just the expected return. In this section, we will understand how to compute the distribution of return using an algorithm called categorical DQN.
The distribution of return is often called the value distribution or return distribution. Let Z be the random variable and Z(s, a) denote the value distribution of a state s and an action a. We know that the Q function is represented by Q(s, a) and it gives the value of a state-action pair. Similarly, now we have Z(s, a) and it gives the value distribution (return distribution) of the state-action pair.
Okay, how can we compute Z(s, a)? First, let's recollect how we compute Q(s, a).
In DQN, we learned that we use a neural network to approximate the Q function, Q(s, a), Since we use a neural network to approximate the Q function, we can represent...