Definition

Policy gradient algorithms directly learn the optimal policy by a parametric probability distribution . The policy stochastically selects an action in a state according to a parameter . It typically proceeds by sampling a stochastic policy and adjusting the parameter in the direction of maximizing the total reward.

The objective function of policy gradient algorithm is defined as: where is a trajectory, is the total reward of the trajectory , and is the PDF of .