In the simplistic example we just saw, to calculate the values of states and actions, we have exploited the structure of the environment: we had no loops in transitions, so we could start from terminal states, calculate their values and then proceed to the central state. However, just one loop in the environment builds an obstacle in our approach. Let's consider such an environment with two states:

We start from state , and the only action we can take leads us to state . We get reward **r=1**,and the only transition from is an action, which brings us back to the . So, the life of our agent is an infinite sequence of states []. To deal with this infinity loop, we can use a discount factor . Now, the question is, what are the values for both the states?

The answer is not very complicated, though. Every transition from to gives us a reward of 1 and every back transition gives us 2. So, our sequence of...