Definition

Sarsa learns by using Bellman Expectation Equation. The both behavior and target policies of Sarsa is derived from , so it is an on-policy method.
Algorithm
- Initialize arbitrarily, and .
- Repeat for each episode:
- Initialize .
- Choose an action from the initial state using policy derived from (e.g. -greedy).
- Repeat for each step of an episode until is terminal:
- Take the action and observe a reward and a next state .
- Choose an action from the next state using policy derived from (e.g. -greedy).
- Update and