Definition

REINFORCE with Baseline algorithm is a variant of the REINFORCE algorithm that helps reduce variance in Policy Gradient methods by adopting Actor-Critic Method. It modifies the objective function of the REINFORCE algorithm by subtracting a baseline from the returns, which helps reduce the variance of the gradient without introducing bias. $\nabla_{θ} J (θ) = \nabla_{θ} E_{π_{θ}} [r (τ)] = E_{π_{θ}} [\sum_{t = 0}^{T - 1} (G_{t} - b (s_{t})) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})]$ where $b (s_{t})$ is a baseline function not related to $a_{t}$ (commonly chosen as a State-Value Function $V (s) = E_{π_{θ}} [G_{t} ∣ S_{t} = s]$ ).

Actor (Policy Gradient Update)

$Δ θ = α (G_{t} - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})$

Critic (Value Network Update)

$Δ ϕ = β (G_{t} - V (s_{t}; ϕ)) \nabla_{ϕ} V (s_{t}; ϕ)$

Algorithm

Initialize state-value $V (s; ϕ)$ and policy $π (a ∣ s; θ)$ networks randomly.
Set the hyperparameters: step-sizes $α, β > 0$ , and discount factor $0 < γ \leq 1$
Repeat for each episode (Each starts from a state $s$ under the policy $π_{θ}$ ):
1. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. $δ \leftarrow G_{t} - V (s_{t}; ϕ)$
  2. $ϕ \leftarrow ϕ + β δ \nabla_{ϕ} V (s_{t}; ϕ)$ (minimize $L (ϕ) = E_{π} [\sum_{t = 0}^{T - 1} (G_{t} - V (s_{t}; ϕ))^{2}]$ )
  3. $θ \leftarrow θ + α γ^{t} δ \nabla_{θ} ln π (a_{t} ∣ s_{t}; θ)$ (maximize $J (θ)$ with the Policy Gradient $\nabla_{θ} J (θ) = E_{π_{θ}} [\sum_{t = 0}^{T - 1} (G_{t} - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})]$ )

My Knowledge Base

Explorer

REINFORCE with Baseline

Definition

Actor (Policy Gradient Update)

Critic (Value Network Update)

Algorithm

Graph View

Table of Contents

Backlinks