Definition

A3C (Asynchronous Advantage Actor-Critic Method) is an Actor-Critic Method that utilizes multiple networks which are a global network and multiple worker agents working in parallel across multiple instances of the environment, and agents update asynchronously the global network parameter. The parallelism reduce each agent’s temporal correlation. The return estimates the Advantage Function using the n-Step Return instead of Action-Value Function .

Actor (Policy Gradient Update)

where is an Entropy that encourages exploration

Critic (Value Network Update)

Algorithm

For each worker agent

  1. Let be global network’s shared parameters, and let be a worker agent’s parameters.
  2. Set the hyperparameters: step-sizes , discount factor , regularization factor , maximum steps per update .
  3. Repeat:
    1. Reset gradients and .
    2. Synchronize parameters and .
    3. Set and get state .
    4. For :
      1. Select action according to policy .
      2. Take the action and observe a reward and a next state .
    5. (or for terminal )
    6. For :
      1. .
      2. Accumulate gradients with respect to
      3. Accumulate gradients with respect to
    7. Update asynchronously and .