Definition
Although TRPO has achieved monotonic improvement, its implementation and computation are complicated due to KL constraint. PPO removed the KL constraint term in the optimization problem of TRPO by introducing the clipped surrogate objective, which provides penalty when the policy update is too high.

PPO apply clipping to the policy ratio . The cliiped surrogate function is defined as where is the policy ratio, and is a hyperparameter that determines the clipping range.
satisfies , so we can still use MM Algorithm.
To minimize the value function loss simultaneously and to encourage exploration, PPO appends additional terms to the objective function. Where
- is the clipped surrogate objective
- is the value function loss
- is the Entropy bonus terms.
- are hyperparameter that controls the weight of the losses.