Definition

800

DQN has large or continuous state space and discrete action space.

Naive DQN

Naive DQN treats $r_{t + 1} + γ max_{a} Q (s_{t + 1}, a; θ)$ as a target and minimizes MSE loss by SGD. $L (θ) = [r_{t + 1} + γ max_{a} Q (s_{t + 1}, a; θ) - Q (s_{t}, a_{t}; θ)]^{2}$

Due to the training instability and correlated samples, the Naive DQN has very poor results, even worse than a linear model.

Experience Replay

Online RL agents incrementally update parameters while observing a stream of experience. This structure cause strongly temporally-correlated updates, breaking i.i.d. assumption, and rapidly forget rare experiences that would be useful later on. Experience replay stores experiences $(s, a, r, s^{'})$ in the replay buffer, and randomly samples temporally uncorrelated minibatches from the replay buffer when learning.

Target Network

If target function is changed too frequently, then this moving target makes training difficult (non-stationary target problem). The target network technique updates the parameters of the behavior Q-network $Q (\cdot; θ)$ at every step, while updating the parameters of the target Q-network $\hat{Q} (\cdot; \hat{θ})$ sporadically (e.g. every $k$ steps). $L (θ) = \frac{1}{B} i \in I \sum [r_{i + 1} + γ max_{a} \hat{Q} (s_{i + 1}, a; \hat{θ}) - Q (s_{i}, a_{i}; θ)]^{2}$ where $B$ is the size of a minibatch, and $I$ is a $B$ -sized index set drew from the replay buffer.

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .
Repeat for each episode:
1. Initialize sequence $s_{0}$ .
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .
  4. Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i + 1}, s_{i + 1})$ from $R$ .
  5. $y_{i} \leftarrow r_{i + 1} + γ a max \hat{Q} (s_{i + 1}, a; \hat{θ})$ .
  6. Perform Gradient Descent on loss $L (θ) = (y_{i} - Q (s_{i}, a_{i}; θ))^{2}$ .
  7. Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.
  8. Update $s_{t} \leftarrow s_{t + 1}$ .

My Knowledge Base

Explorer

Deep Q-Network

Definition

Naive DQN

Experience Replay

Target Network

Algorithm

Graph View

Table of Contents

Backlinks