## Reinforcement learning solution methods

In this section, we will discuss in detail some of the methods to solve Reinforcement Learning problems. Specifically, dynamic programming (DP), Monte Carlo method, and temporal-difference (TD) learning. These methods address the problem of delayed rewards as well.

### Dynamic Programming (DP)

DP is a set of algorithms that are used to compute optimal policies given a model of environment like Markov Decision Process. Dynamic programming models are both computationally expensive and assume perfect models; hence, they have low adoption or utility. Conceptually, DP is a basis for many algorithms or methods used in the following sections:

**Evaluating the policy**: A policy can be assessed by computing the value function of the policy in an iterative manner. Computing value function for a policy helps find better policies.**Improving the policy**: Policy improvement is a process of computing the revised policy using its value function information.**Value iteration...**