Definition

Sarsa learns by using Bellman Expectation Equation. The both behavior and target policies of Sarsa is derived from , so it is an on-policy method.

Algorithm

  1. Initialize arbitrarily, and .
  2. Repeat for each episode:
    1. Initialize .
    2. Choose an action from the initial state using policy derived from (e.g. -greedy).
    3. Repeat for each step of an episode until is terminal:
      1. Take the action and observe a reward and a next state .
      2. Choose an action from the next state using policy derived from (e.g. -greedy).
      3. Update and