Definition

The Implicit Quantile Network (IQN) algorithm is a Distributional Reinforcement Learning algorithm that implicitly estimates the quantiles of a return distribution by learning a function that maps a quantile fraction to the corresponding quantile value.

Architecture

IQN estimates the quantile value $Z (s, a, τ) = F_{Z}^{- 1} (τ)$ for a given state $s$ , action $a$ , and a quantile fraction $τ \in [0, 1]$ . The output is the estimated quantile value corresponding to that fraction $τ$ .

The input state $s$ processed by the encoder layers of neural network to produce a state embedding vector $ϕ (s)$ .

The quantile fraction $τ$ is embedded into a higher-dimensional vector $ψ (τ)$ using a set of basis functions, where whose dimension is the same as the one of the state-feature $ϕ (s)$ . $ψ_{i} (τ) = cos (iπ τ)$ The two embeddings: state embedding and quantile-embedding, are combined using Hadamard Product $ϕ (s) ⊙ ψ (τ)$ .

The combined embedding is fed into the further layers $f_{θ}$ to predict the quantile value $Z (s, a, τ)$

In summary, the quantile value for a given state is estimated by $Z (s, a, τ; θ) \approx f_{θ} (ϕ (s), ψ (τ))$

By sampling different quantile fractions $τ$ , IQN can approximate the entire return distribution. The expectation of the return can be approximated by averaging the estimated quantile values over the sampled quantile fractions $Q (s, a; θ) \approx \frac{1}{N} i = 1 \sum N Z (s, a, τ_{i}; θ)$ where ${τ_{i}}_{i = 1}^{N}$ are sampled quantile fractions.

Loss Function

IQN uses a quantile Huber loss function, similar to QR-DQN. Given a transition $(s, a, r, s^{'})$ , the loss is defined as $L (θ) = i = 1 \sum N E_{τ^{'}} [ρ_{\overset{τ}{^}_{i}}^{κ} (T Z_{τ_{j}^{'}} - Z (s, a, τ_{i}; θ))]$ Where:

${\overset{τ}{^}_{i}}_{i = 1}^{N}$ and ${τ_{j}^{'}}_{j = 1}^{N^{'}}$ are quantile fractions for the current state-action pair and next state, respectively.
$T Z_{τ_{j}^{'}} = r + γ Z (s^{'}, a^{*}, τ_{j}^{'}; \hat{θ})$ and $a^{*} = a argmax \frac{1}{N ^{'}} j = 1 \sum N^{'} Z (s^{'}, a, τ_{j}^{'}; \hat{θ}) \approx a argmax Q (s^{'}, a; \hat{θ})$ .
$ρ_{τ}^{κ} (x) = L_{κ} (x) (τ - 1 (x < 0))$ where $L_{κ} (x)$ is Huber Loss.

Risk-Sensitive Policy

$β : [0, 1] \to [0, 1]$ The IQN has information about the entire distribution by estimating the distribution of returns. This property makes it suitable for implementing risk-sensitive policies, allowing for decision-making considering risk. Risk-sensitive policies focus on not just the expected return, but the variability or uncertainty in the returns. IQN makes it feasible by providing estimated quantiles of the return distribution.

The function $β : [0, 1] \to [0, 1]$ is used to define a distortion risk measure that focuses on specific parts of the return distribution. By applying $β$ to the quantile fraction $τ$ , we can re-wright each outcome or change the sampling distribution.

These are examples of distortion measures:

Cumulative Probability Weighting Parametrization (CPW): $CPW (η, τ) = \frac{τ ^{η}}{( τ ^{η} + ( 1 - τ ) ^{η} ) ^{1/ η}}$
Wang: $Wang (η, τ) = Φ (Φ^{- 1} (τ) + η)$ where $Φ$ is the CDF for standard normal distribution.
Power Formula: $Pow (η, τ) = {τ^{1/ (1 + ∣ η ∣)} 1 - (1 - τ)^{1/ (1 + ∣ η ∣)} η \geq 0 η < 0$
Conditional Value at Risk (CVaR): $CVaR (η, τ) = η τ$

Algorithm

Initialize behavior quantile network $Z$ and target quantile network $\hat{Z}$ with random weights $θ$ , the sample sizes $K, N, N^{'}$ , and a distortion measure $β$ .
Repeat for each episode:
1. Initialize sequence $s_{0}$ .
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. Sample $K$ quantile fractions ${\tilde{τ}_{k} \sim β (\cdot)}_{k = 1}^{K}$ under the distortion measure $β$ , | and select a greedy action $a_{t} = a argmax \frac{1}{K} k = 1 \sum K Z (s_{t}, a, \tilde{τ}_{k}; θ) \approx a argmax Q (s_{t}, a; θ)$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .
  4. Sample a random transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ from $R$ .
  5. Sample $N^{'}$ quantile fractions ${τ_{j}^{'} \sim U (0, 1)}_{j = 1}^{N^{'}}$ for the next state and select a greedy action $a^{*} = a argmax \frac{1}{N ^{'}} j = 1 \sum N^{'} Z (s_{t + 1}, a, τ_{j}^{'}; \hat{θ}) \approx a argmax Q (s^{'}, a; \hat{θ})$ .
  6. Compute the target quantile values $T Z_{τ_{j}^{'}} = r_{t + 1} + γ Z (s_{t + 1}, a^{*}, τ_{j}^{'}; \hat{θ})$ .
  7. Perform Gradient Descent on the loss $L (θ) = i = 1 \sum N j = 1 \sum N^{'} ρ_{\overset{τ}{^}_{i}}^{κ} (T Z_{τ_{j}^{'}} - Z (s, a, \overset{τ}{^}_{i}; θ))$ .
  8. Update the target network $\hat{Z}$ parameter $\hat{θ} \leftarrow θ$ every $C$ steps.
  9. Update $s_{t} \leftarrow s_{t + 1}$ .

My Knowledge Base

Explorer

IQN

Definition

Architecture

Loss Function

Risk-Sensitive Policy

Algorithm

Graph View

Table of Contents

Backlinks