Definition

A3C (Asynchronous Advantage Actor-Critic Method) is an Actor-Critic Method that utilizes multiple networks which are a global network and multiple worker agents working in parallel across multiple instances of the environment, and agents update asynchronously the global network parameter. The parallelism reduce each agent’s temporal correlation. The return $G_{t : t + n} - V (s_{t})$ estimates the Advantage Function $Q (s_{t}, a_{t}) - V (s_{t})$ using the n-Step Return $G_{t : t + n}$ instead of Action-Value Function $Q (s_{t}, a_{t})$ .

Actor (Policy Gradient Update)

$Δ θ = α [(G_{t : t + n} - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t}) + λ H (π_{θ} (a_{t} ∣ s_{t}))]$ where $H (p (x)) = - \int p (x) ln p (x) d x$ is an Entropy that encourages exploration

Critic (Value Network Update)

$Δ ϕ = β (G_{t : t + n} - V (s_{t}; ϕ)) \nabla_{ϕ} V (s_{t}; ϕ)$

Algorithm

For each worker agent

Let $θ, ϕ$ be global network’s shared parameters, and let $θ^{'}, ϕ^{'}$ be a worker agent’s parameters.
Set the hyperparameters: step-sizes $α, β > 0$ , discount factor $0 < γ \leq 1$ , regularization factor $λ$ , maximum steps per update $t_{max}$ .
Repeat:
1. Reset gradients $d θ \leftarrow 0$ and $d ϕ \leftarrow 0$ .
2. Synchronize parameters $θ^{'} \leftarrow θ$ and $ϕ \leftarrow ϕ^{'}$ .
3. Set $t_{start} \leftarrow t$ and get state $s_{t}$ .
4. For $t \leq t_{start} + t_{max}$ :
  1. Select action $a_{t}$ according to policy $π (\cdot ∣ s_{t}; θ^{'})$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. $t \leftarrow t + 1$
5. $R = V (s_{t}; ϕ^{'})$ (or $0$ for terminal $s_{t}$ )
6. For $i = t - 1, t - 2, \dots, t_{start}$ :
  1. $R \leftarrow r_{i + 1} + γ R$ .
  2. Accumulate gradients with respect to $θ^{'}$ $d θ \leftarrow d θ + (R - V (s_{i}; ϕ^{'})) \nabla_{θ^{'}} ln π_{θ^{'}} (a_{i} ∣ s_{i}) + λ H (π_{θ} (a_{i} ∣ s_{i}))]$
  3. Accumulate gradients with respect to $ϕ^{'}$ $d ϕ \leftarrow d ϕ - (R - V (s_{i}; ϕ^{'})) \nabla_{ϕ^{'}} V (s_{t}; ϕ^{'})$
7. Update asynchronously $θ \leftarrow θ + α d θ$ and $ϕ \leftarrow ϕ - β d ϕ$ .

My Knowledge Base

Explorer

Asynchronous Advantage Actor-Critic Method

Definition

Actor (Policy Gradient Update)

Critic (Value Network Update)

Algorithm

Graph View

Table of Contents

Backlinks