# Direct policy search through policy gradient

The last method we're going to discuss doesn't employ a proxy to find the optimal policy, but rather looks for it directly. In this context, we always assume we're working with a stochastic environment where the transitions are not fully deterministic. For example, a robot can assign a high probability to a transition but, in order to increase the robustness, it has also to include a noise term that can lead the transition to a different state. Therefore, we need to include a transition probability .

Given this element, we can evaluate the overall probability of an entire sequence (that normally ends in a final state): *S*_{k} = (*s*_{1}, *s*_{2}, …, *s*_{k}). To do so, we need to define a parameterized policy ; therefore, the probability of *S*_{k} can be expressed as a conditional one: . The full expression requires us to explicitly separate the state transition component from the policy and to introduce the actions (which were implicit...