PPO-Penalty
The algorithm for the PPO-penalty method is given as follows:
- Initialize the policy network parameter and value network parameter and initialize the penalty coefficient and the target KL divergence
- For iterations :
- Collect some N number of trajectories following the policy
- Compute the return (reward-to-go) Rt
- Compute
- Compute the gradient of the objective function
- Update the policy network parameter using gradient ascent,
- If d is greater than or equal to , then we set ; if d is lesser than or equal to , then we set,
- Compute the mean squared error of the value network:
- Compute the gradients of the value network
- Update the value network parameter using gradient descent,