We started off the chapter by understanding the Bellman equation of the value and Q functions. We learned that, according to the Bellman equation, the value of a state is the sum of the immediate reward, and the discounted value of the next state and the value of a state-action pair is the sum of the immediate reward and the discounted value of the next state-action pair. Then we learned about the optimal Bellman value function and the Q function, which gives the maximum value.
Moving forward, we learned about the relation between the value and Q functions. We learned that the value function can be extracted from the Q function as and then we learned that the Q function can be extracted from the value function as .
Later we learned about two interesting methods called value iteration and policy iteration, which use dynamic programming to find the optimal policy.
In the value iteration method, first, we compute the optimal value function by...