Definition

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$ Q-learning directly learns $Q^{*}$ by using Bellman Optimality Equation. The behavior policy of Q-learning is derived from $Q$ while the target policy is $max_{a} Q$ , so it is an off-policy method.

Algorithm

Initialize $\forall s \in S, a \in A (S)$ , $Q (s, a)$ arbitrarily, and $Q (terminal, \cdot) = 0$ .
Repeat for each episode:
1. Initialize $S$ .
2. Repeat for each step of an episode until $S$ is terminal:
  1. Choose an action $A$ from the initial state $S$ using policy derived from $Q$ (e.g. $ϵ$ -greedy).
  2. Take the action $A$ and observe a reward $R$ and a next state $S^{'}$ .
  3. $Q (S, A) \leftarrow Q (S, A) + α [R + γ a max Q (S^{'}, a) - Q (S, A)]$
  4. Update $S \leftarrow S^{'}$

My Knowledge Base

Explorer

Q-Learning

Definition

Algorithm

Graph View

Table of Contents

Backlinks