As per the policy gradient theorem, for the previous specified policy objective functions and any differentiable policy
the policy gradient is as follows:
Steps to update parameters using the Monte Carlo policy gradient based approach is shown in the following section.
In the Monte Carlo policy gradient approach, we update the parameters by the stochastic gradient ascent method, using the update as per policy gradient theorem and
as an unbiased sample of
. Here,
is the cumulative reward from that time-step onward.
The Monte Carlo policy gradient approach is as follows:
Initialize arbitrarily for each episode as per the current policy do for step t=1 to T-1 do end for end for Output: final