Definition

Although TRPO has achieved monotonic improvement, its implementation and computation are complicated due to KL constraint. PPO removed the KL constraint term in the optimization problem of TRPO by introducing the clipped surrogate objective, which provides penalty when the policy update is too high.

PPO apply clipping to the policy ratio $\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )}$ . The cliiped surrogate function is defined as $L^{C L I P} (θ) = E [min (r (θ) A_{θ_{old}} (s, a), clip (r (θ), 1 - ϵ, 1 + ϵ) A_{θ_{old}} (s, a))]$ where $r (θ) = \frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )}$ is the policy ratio, and $ϵ$ is a hyperparameter that determines the clipping range.

$L^{C L I P} (θ)$ satisfies $L^{C L I P} (θ) \leq L^{TRPO} (θ)$ , so we can still use MM Algorithm.

To minimize the value function loss simultaneously and to encourage exploration, PPO appends additional terms to the objective function. $L^{C L I P + V F + S} (θ) = L^{C L I P} - c_{1} L^{V F} (θ) + c_{2} S [π_{θ}] (s)$ Where

$L^{C L I P} (θ)$ is the clipped surrogate objective
$L^{V F} (θ) = (V_{target} - V_{θ} (s))^{2}$ is the value function loss
$S [π_{θ}] (s) = - a \sum π_{θ} (a ∣ s) ln π_{θ} (a ∣ s)$ is the Entropy bonus terms.
$c_{1}, c_{2}$ are hyperparameter that controls the weight of the losses.

My Knowledge Base

Explorer

Proximal Policy Optimization

Definition

Graph View

Backlinks