Definition

Policy gradient algorithms directly learn the optimal policy by a parametric probability distribution $π_{θ} (a ∣ s)$ . The policy stochastically selects an action $a$ in a state $s$ according to a parameter $θ$ . It typically proceeds by sampling a stochastic policy and adjusting the parameter $θ$ in the direction of maximizing the total reward.

The objective function of policy gradient algorithm is defined as: $J (θ) = E_{π_{θ}} [r (τ)] = \int p (τ; θ) r (τ) d τ$ where $τ$ is a trajectory, $r (τ)$ is the total reward of the trajectory $τ$ , and $p (τ; θ) = π_{θ} (τ)$ is the PDF of $τ$ .

My Knowledge Base

Explorer

Policy Gradient

Definition

Graph View

Backlinks