Definition

The performance of DDPG does not improve monotonically. TRPO addresses the problem by adopting two concepts: Minorize-Maximization Algorithm and trust region. TRPO updates the policy parameter on a trust region in the policy space, guaranteeing monotonic improvement of the objective function (expected return). Although TRPO has achieved high performance, its implementation is very complex so being not practical.

TRPO starts from the maximization problem of expected return

The original problem is approximated multiple times to find a surrogate function and be solved by MM Algorithm. The surrogate function satisfying the trust region constraint is found using the conservative policy iteration update. where is a local approximation of , and is a State Visitation Frequency.

The KL penalized optimization problem is transformed to a constrained optimization problem by Duality.

By approximations, the final optimization problem is

The optimization problem is solved by the natural policy gradient and check the monotonic improvement with Backtracking Line Search where is a policy gradient, is the Fisher Information Metric, and is the constraint threshold.

TRPO is performed by repeating 3 steps in each iteration

  1. Collect trajectories on , and update -values.
  2. By averaging over samples, construct the estimated objective and constraint.
  3. Approximately solve the constrained optimization by natural policy gradient and Backtracking Line Search to update the policy parameter .

Algorithm

  1. Initialize policy parameters .
  2. For :
    1. Collect a set of trajectories under current policy .
    2. Compute advantages using an advantage estimation algorithm.
    3. Compute policy gradient and Fisher Information Metric .
    4. Compute natural gradient .
    5. Estimate proposed step
    6. Perform Backtracking Line Search to obtain final update: 2. Find minimum such that the KL constraint is satisfied.
    7. Update the policy