# Policy iteration

In this section, we are going to analyze a strategy to find an optimal policy based on complete knowledge of the environment (in terms of transition probability and expected returns). The first step is to define a method that can be employed to build a greedy policy. Let's suppose we're working with a finite MDP and a generic policy ; we can define the intrinsic value of a state *s*_{t} as the expected discounted return obtained by the agent starting from *s*_{t} and following the stochastic policy :

In this case, we are assuming that, as the agent will follow , the state *s*_{a} is more useful than *s*_{b} if the expected return starting from *s*_{a} is greater than the one obtained starting from *s*_{b}. Unfortunately, trying to directly find the value of each state using the previous definition is almost impossible when . However, this is a problem that can be solved using dynamic programming (for further information, please refer to R. A. Howard, *Dynamic Programming and...*