Definition

Actor (Policy Gradient Update)

$Δ θ = α (r_{t + 1} + γV (s_{t + 1}; ϕ) - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})$

$Δ ϕ = β (r + γV (s_{t + 1}; ϕ) - V (s_{t}; ϕ)) \nabla_{ϕ} V (s_{t}; ϕ)$

Initialize critic $V (s; ϕ)$ and actor $π (a ∣ s; θ)$ networks randomly.
Set the hyperparameters: step-sizes $α, β > 0$ , and discount factor $0 < γ \leq 1$
Repeat for each episode (Each starts from a state $s$ under the policy $π_{θ}$ ):
1. $I = 1$
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. Select action $a$ according to policy $π (\cdot ∣ s; θ)$ .
  2. Take the action $a$ and observe a reward $r$ and a next state $s^{'}$ .
  3. $δ \leftarrow r + γV (s^{'}; ϕ) - V (s; ϕ)$
  4. $ϕ \leftarrow ϕ + β δ \nabla_{ϕ} V (s; ϕ)$
  5. $θ \leftarrow θ + α I δ \nabla_{θ} ln π (a ∣ s; θ)$
  6. $I \leftarrow γ I$
  7. $s \leftarrow s^{'}$