The complete example is in Chapter05/01_frozenlake_v_learning.py
. The central data structures in this example are as follows:
Reward table: A dictionary with the composite key "source state" + "action" + "target state". The value is obtained from the immediate reward.
Transitions table: A dictionary keeping counters of the experienced transitions. The key is the composite "state" + "action" and the value is another dictionary that maps the target state into a count of times that we've seen it. For example, if in state 0 we execute action 1 ten times, after three times it leads us to state 4 and after seven times to state 5. Entry with the key (0, 1) in this table will be a dict
{4: 3, 5: 7}
. We use this table to estimate the probabilities of our transitions.Value table: A dictionary that maps a state into the calculated value of this state.
The overall logic of our code is simple: in the loop, we play 100 random steps from the environment, populating the reward...