Definition

Actor (Policy Gradient Update)

Critic (Value Network Update)

Algorithm

  1. Initialize critic and actor networks randomly.
  2. Set the hyperparameters: step-sizes , and discount factor
  3. Repeat for each episode (Each starts from a state under the policy ):
    1. Repeat for each step of an episode until terminal, :
      1. Select action according to policy .
      2. Take the action and observe a reward and a next state .