Definition

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$ Sarsa learns $Q^{π}$ by using Bellman Expectation Equation. The both behavior and target policies of Sarsa is derived from $Q$ , so it is an on-policy method.

Algorithm

Initialize $\forall s \in S, a \in A (S), Q (s, a)$ arbitrarily, and $Q (terminal, \cdot) = 0$ .
Repeat for each episode:
1. Initialize $S$ .
2. Choose an action $A$ from the initial state $S$ using policy derived from $Q$ (e.g. $ϵ$ -greedy).
3. Repeat for each step of an episode until $S$ is terminal:
  1. Take the action $A$ and observe a reward $R$ and a next state $S^{'}$ .
  2. Choose an action $A^{'}$ from the next state $S^{'}$ using policy derived from $Q$ (e.g. $ϵ$ -greedy).
  3. $Q (S, A) \leftarrow Q (S, A) + α [R + γ Q (S^{'}, A^{'}) - Q (S, A)]$
  4. Update $S \leftarrow S^{'}$ and $A \leftarrow A^{'}$

My Knowledge Base

Explorer

Sarsa

Definition

Algorithm

Graph View

Table of Contents

Backlinks