The whole example is in the Chapter05/02_frozenlake_q_learning.py
file, and the difference is really minor. The most obvious change is to our value table. In the previous example, we kept the value of the state, so the key in the dictionary was just a state. Now we need to store values of the Q-function, which has two parameters: state and action, so the key in the value table is now a composite.
The second difference is in our calc_action_value
function. We just don't need it anymore, as our action values are stored in the value table. Finally, the most important change in the code is in the agent's value_iteration
method. Before, it was just a wrapper around the calc_action_value
call, which did the job of Bellman approximation. Now, as this function has gone and was replaced by a value table, we need to do this approximation in the value_iteration
method.
Let's look at the code. As it's almost the same, I'll jump directly to the most interesting value_iteration...