Definition

Double DQN is a variation of DQN that uses the idea of Double Q-Learning to reduce overestimation. In this case, the target network in DQN works as the second network for double Q-learning without introducing an additional network.

$L (θ) = [r_{t + 1} + γ \hat{Q} (s_{t + 1}, a argmax Q (s_{t + 1}, a; θ); \hat{θ}) - Q (s_{t}, a_{t}; θ)]^{2}$

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .
Repeat for each episode:
1. Initialize sequence $s_{0}$ .
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .
  4. Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i + 1}, s_{i + 1})$ from $R$ .
  5. $y_{i} \leftarrow r_{i + 1} + γ \hat{Q} (s_{i + 1}, a argmax Q (s_{i + 1}, a; θ); \hat{θ})$ .
  6. Perform Gradient Descent on loss $L (θ) = (y_{i} - Q (s_{i}, a_{i}; θ))^{2}$ .
  7. Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.
  8. Update $s_{t} \leftarrow s_{t + 1}$ .

My Knowledge Base

Explorer

Double DQN

Definition

Algorithm

Graph View

Table of Contents

Backlinks