Definition
Actor (Policy Gradient Update)
Δθ=α(rt+1+γV(st+1;ϕ)−V(st;ϕ))∇θlnπθ(at∣st)
Critic (Value Network Update)
Δϕ=β(r+γV(st+1;ϕ)−V(st;ϕ))∇ϕV(st;ϕ)
Algorithm
- Initialize critic V(s;ϕ) and actor π(a∣s;θ) networks randomly.
- Set the hyperparameters: step-sizes α,β>0, and discount factor 0<γ≤1
- Repeat for each episode (Each starts from a state s under the policy πθ):
- I=1
- Repeat for each step of an episode until terminal, t=0,1,…,T−1 :
- Select action a according to policy π(⋅∣s;θ).
- Take the action a and observe a reward r and a next state s′.
- δ←r+γV(s′;ϕ)−V(s;ϕ)
- ϕ←ϕ+βδ∇ϕV(s;ϕ)
- θ←θ+αIδ∇θlnπ(a∣s;θ)
- I←γI
- s←s′