Definition

The Quantile-Regression DQN (QR-DQN) algorithm is a Distributional Reinforcement Learning algorithm that approximates the distribution of random return using quantile regression.

Architecture

QR-DQN estimates a set of $N$ quantiles ${\overset{τ}{^}_{i}}_{i = 1}^{N}$ of return distribution, where $\overset{τ}{^}_{i} = \frac{2 i - 1}{2 N}$ represents the midpoint of $i$ -th quantile interval. This can be seen as adjusting the location of the supports of a uniform probability mass to approximate the desired quantile distribution.

The quantile distribution $Z$ with uniform probability $\frac{1}{N}$ is constructed as $Z (s, a) = \frac{1}{N} i = 1 \sum N (s, a) δ_{z_{i} (s, a)}$ where $δ_{x}$ is a Dirac’s delta function at $x$ , and ${z_{i} (s, a)}$ are the outputs of the network, representing the estimated quantile values. These values are obtained by applying the inverse CDF of the return distribution to the quantile midpoints $z_{i} (s, a) = F_{Z}^{- 1} (\overset{τ}{^}_{i})$ .

Using the estimated quantile values as a support minimizes the Wasserstein Distance between the true return distribution and the estimated distribution.

Quantile Regression

Given data set ${x_{j}}$ , a $τ$ -quantile $q_{τ}$ minimizes the loss $j = 1 \sum N ρ_{τ} (x_{j} - q_{τ})$ , where $ρ_{τ} (x) = x (τ - 1 (x < 0))$ is a quantile loss function.

The quantile values are estimated by minimizing the quantile Huber loss function. Given a transition $(s, a, r, s^{'})$ , the loss is defined as $L (θ) = i = 1 \sum N E_{j} [ρ_{\overset{τ}{^}_{i}}^{κ} (T z_{j} - z_{i} (s, a))] = \frac{1}{N} i = 1 \sum N j = 1 \sum N ρ_{\overset{τ}{^}_{i}}^{κ} (T z_{j} - z_{i} (s, a))$ Where:

$T z_{j} = r + γ z_{j} (s^{'}, a^{*})$ and $a^{*} = a argmax Q (s^{'}, a)$ .
$ρ_{τ}^{κ} (x) = L_{κ} (x) (τ - 1 (x < 0))$ where $L_{κ} (x)$ is Huber Loss.

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the sample size $N$ .
Repeat for each episode:
1. Initialize sequence $s_{0}$ .
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .
  4. Sample a random transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ from $R$ .
  5. Select a greedy action $a^{*} = a argmax Q (s_{t + 1}, a) = a argmax \frac{1}{N} i = 0 \sum N - 1 z_{i} (s_{t + 1}, α; \hat{θ})$
  6. Compute the target quantile values $T z_{j} (s_{t + 1}, a^{*}; \hat{θ}) = r_{t + 1} + γ z_{j} (s_{t + 1}, a^{*})$ .
  7. Perform Gradient Descent on the loss $L (θ) = i = 1 \sum N E_{j} [ρ_{\overset{τ}{^}_{i}}^{κ} (T z_{j} (s_{t + 1}, a^{*}; \hat{θ}) - z_{i} (s_{t}, a_{t}; θ))]$ .
  8. Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.
  9. Update $s_{t} \leftarrow s_{t + 1}$ .

My Knowledge Base

Explorer

QR-DQN

Definition

Architecture

Quantile Regression

Algorithm

Graph View

Table of Contents

Backlinks