Definition

Policy Evaluation

Compute from the current policy .

Distributional Bellman operator for a fixed policy is a Contraction Map in a maximal form of Wasserstein Metric.

Thus, the iteration converges to a unique fixed point (by Contraction Lemma).

Control (Policy Improvement)

Seek a greedy policy based on the , which maximizes the expectation of .

Distributional Bellman optimality operator is not a Contraction Map, not even continuous in any Metric, causing the distributional instability. To overcome the problem, discrete action-value distributions are used to approximate the true distribution .