Definition
Policy gradient algorithms directly learn the optimal policy by a parametric probability distribution . The policy stochastically selects an action in a state according to a parameter . It typically proceeds by sampling a stochastic policy and adjusting the parameter in the direction of maximizing the total reward.
The objective function of policy gradient algorithm is defined as: where is a trajectory, is the total reward of the trajectory , and is the PDF of .