Definition

The C51 algorithm is a Distributional Reinforcement Learning algorithm that approximates the distribution of the random return a using discretized distribution.

Architecture

The range of possible returns is divided into a fixed support of $N$ -equally spaced bins ${z_{i}}_{i = 0}^{N - 1}$ , where $z_{i} = R_{min} + i Δ z$ and $Δ z = \frac{R _{max} - R _{min}}{N - 1}$ . $R_{min}$ and $R_{max}$ are the minimum and maximum possible return values, respectively. In the C51 algorithm, the support determined once at the beginning and remains fixed throughout the entire learning process.

For each state-action pair $(s, a)$ , the algorithm estimates the probabilities ${p_{i} (s, a)}$ that the return will fall into each $i$ -th bin. The probabilities are estimated by a neural network, which takes the state $s$ as input and outputs a vector of probabilities ${p_{i} (s, a)}$ for each state-possible action pair $(s, a), \forall a \in A (s)$ . The Softmax Function ensures the properties of the probability of the output for each state-action pair. $p_{i} (s, a; θ) = \frac{e ^{θ_{i} (s, a)}}{\sum _{j = 0}^{N - 1} e ^{θ_{j} (s, a)}}$ where $θ_{i} (s, a)$ are the raw outputs (logits) of the network for the $i$ -th atom and action $a$ , and $θ$ represents the network’s parameters.

The discrete distribution $Z_{θ}$ over the fixed support ${z_{i}}_{i = 0}^{N - 1}$ is constructed as $Z_{θ} (s, a) = i = 0 \sum N - 1 p_{i} (s, a) δ_{z_{i}}$ where $δ_{x}$ is a Dirac’s delta function at $x$ .

Projection

In a policy evaluation process, discrete distributions have a problem: $T Z_{θ}$ and $Z_{θ}$ have disjoint supports. Though, Wasserstein Metric in the original setup is robust to this issue, but in practice, KL-Divergence is used instead due to the distributional instability and differentiability. Therefore, the problem still matters.

The disjoint support problem is solved by projection.

Compute the distributional Bellman update: $\hat{Z}_{θ} \leftarrow r + γ Z_{θ} (s^{'}, a^{*})$ where $a^{*} = a argmax Q (s^{'}, a)$
Distribute the computed probability mass $\hat{Z}_{θ}$ to the neighboring bins in the support ${z_{i}}_{i = 0}^{N - 1}$ proportionally to the distance from each original point: $(Φ T Z_{θ}) (s, a)$ , where $Φ$ is the projection operator.

Loss Function

C51 uses the KL-Divergence as a loss function. $L_{s, a} (θ) = D_{K L} (Φ \hat{T} Z_{\hat{θ}} (s, a) ∣∣ Z_{θ} (s, a)))$ where $θ$ and $\hat{θ}$ are the parameters for behavior and target networks, respectively.

The loss function can be simplified to the Cross-Entropy Loss. $L (θ) = - i = 0 \sum N - 1 (Φ \hat{T} Z (s, a; \hat{θ}))_{i} ln p_{i} (s, a; θ)$

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .
Repeat for each episode:
1. Initialize sequence $s_{0}$ .
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .
  4. Sample a random transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ from $R$ .
  5. Compute $Q (s_{t + 1}, a) = i = 0 \sum N - 1 z_{i} p_{i} (s_{t + 1}, α; \hat{θ})$ and select a greedy action $a^{*} = a argmax Q (s_{t + 1}, a)$ .
  6. Perform Gradient Descent on the loss $L (θ) = - i = 0 \sum N - 1 (Φ \hat{T} Z (s_{t + 1}, a^{*}; \hat{θ}))_{i} ln p_{i} (s_{t}, a_{t}; θ)$ .
  7. Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.
  8. Update $s_{t} \leftarrow s_{t + 1}$ .

My Knowledge Base

Explorer

C51

Definition

Architecture

Projection

Loss Function

Algorithm

Graph View

Table of Contents

Backlinks