Definition

The performance of DDPG does not improve monotonically. TRPO addresses the problem by adopting two concepts: Minorize-Maximization Algorithm and trust region. TRPO updates the policy parameter $θ$ on a trust region in the policy space, guaranteeing monotonic improvement of the objective function (expected return). Although TRPO has achieved high performance, its implementation is very complex so being not practical.

TRPO starts from the maximization problem of expected return $η (π) := E_{τ} [t = 0 \sum \infty γ^{t} (s_{t})]$

The original problem is approximated multiple times to find a surrogate function and be solved by MM Algorithm. The surrogate function satisfying the trust region constraint is found using the conservative policy iteration update. $η (π) \geq L_{π_{old}} (π) - \frac{4 ϵ γ}{( 1 - γ ) ^{2}} max_{s} D_{K L} (π_{θ_{old}} (\cdot ∣ s) ∣∣ π_{θ} (\cdot ∣ s))$ where $L_{π_{old}} (π) = η (π) + s \sum ρ_{π_{old}} (s) a \sum π (a ∣ s) A_{π_{old}} (s, a)$ is a local approximation of $η (π)$ , and $ρ_{π} (s)$ is a State Visitation Frequency.

The KL penalized optimization problem is transformed to a constrained optimization problem by Duality. $max_{θ} L_{π_{θ_{old}}} (π_{θ}) subject to max_{s} D_{K L} (π_{θ_{old}} (\cdot ∣ s) ∣∣ π_{θ} (\cdot ∣ s)) \leq δ$

By approximations, the final optimization problem is $E_{s \sim ρ_{θ_{old}}, a \sim π_{θ_{old}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} A_{θ_{old}} (s, a)] subject to E_{s \sim ρ_{θ_{old}}} [D_{K L} (π_{θ_{old}} (\cdot ∣ s) ∣∣ π_{θ} (\cdot ∣ s))] \leq δ$

The optimization problem is solved by the natural policy gradient and check the monotonic improvement with Backtracking Line Search $θ = θ_{old} + \frac{2 δ}{g ^{⊺} F ^{- 1} g} F^{- 1} g$ where $g = \nabla_{θ} L_{θ_{old}} (θ) ∣_{θ = θ_{old}} = \nabla_{θ} (\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} A_{θ_{old}} (s, a))_{θ = θ_{old}}$ is a policy gradient, $F$ is the Fisher Information Metric, and $δ$ is the constraint threshold.

TRPO is performed by repeating 3 steps in each iteration

Collect trajectories on $π_{old}$ , and update $Q$ -values.
By averaging over samples, construct the estimated objective and constraint.
Approximately solve the constrained optimization by natural policy gradient and Backtracking Line Search to update the policy parameter $θ$ .

Algorithm

Initialize policy parameters $θ_{0}$ .
For $k = 0, 1, \dots, K$ :
1. Collect a set of trajectories under current policy $π_{k} = π (θ_{k})$ .
2. Compute advantages $A^{π_{k}} (s, a)$ using an advantage estimation algorithm.
3. Compute policy gradient $g_{k}$ and Fisher Information Metric $F_{k}$ .
4. Compute natural gradient $x_{k} \approx F_{k}^{- 1} g_{k}$ .
5. Estimate proposed step $Δ_{k} \leftarrow \frac{2 δ}{x _{k}^{⊺} F _{k} x _{k}} x_{k}$
6. Perform Backtracking Line Search to obtain final update: 2. Find minimum $j \in Z_{0}^{+}$ such that the KL constraint is satisfied. $j min j \in Z_{0}^{+} such that D_{K L} (π_{θ_{k}} (\cdot ∣ s) ∣∣ π_{θ_{k} + α^{j} Δ_{k}} (\cdot ∣ s)) \leq δ$
7. Update the policy $θ_{k + 1} \leftarrow θ_{k} + α^{j} Δ_{k}$

My Knowledge Base

Explorer

Trust Region Policy Optimization

Definition

Algorithm

Graph View

Table of Contents

Backlinks