Definition

Q-Learning takes the action with the highest value, which tends to overestimate the value of actions (maximization bias) in the early stages of learning, slowing the convergence to $q_{*} (s, a)$ .

Double Q-learning uses two Q-value functions to overcome this problem: one for action selection and another for value evaluation.

Algorithm

Initialize $\forall s \in S, a \in A (S)$ , $Q_{1} (s, a)$ , and $Q_{2} (s, a)$ arbitrarily, and $Q_{1} (terminal, \cdot) = 0$ and $Q_{2} (terminal, \cdot) = 0$ .
Repeat for each episode:
1. Initialize $S$ .
2. Repeat for each step of an episode until $S$ is terminal:
  1. Choose an action $A$ from the initial state $S$ using policy derived from $Q_{1}$ and $Q_{2}$ (e.g. $ϵ$ -greedy in $Q_{1} + Q_{2}$ ).
  2. Take the action $A$ and observe a reward $R$ and a next state $S^{'}$ .
  3. With probability $0.5$ : $Q_{1} (S, A) \leftarrow Q_{1} (S, A) + α [R + γ Q_{2} (S^{'}, a argmax Q_{1} (S^{'}, a)) - Q_{1} (S, A)]$ else: $Q_{2} (S, A) \leftarrow Q_{2} (S, A) + α [R + γ Q_{1} (S^{'}, a argmax Q_{2} (S^{'}, a)) - Q_{2} (S, A)]$
  4. Update $S \leftarrow S^{'}$

Examples

Q-Learning initially learns to take the left action much more often than the right action despite the lower true state-value of the left. Even at asymptote, it takes the left action more than optimal (5%). In contrast, double Q-learning is unaffected by maximization bias.

My Knowledge Base

Explorer

Double Q-Learning

Definition

Algorithm

Examples

Graph View

Table of Contents

Backlinks