Definition
Policy Evaluation
Compute from the current policy .
Distributional Bellman operator for a fixed policy is a Contraction Map in a maximal form of Wasserstein Metric.
Thus, the iteration converges to a unique fixed point (by Contraction Lemma).
Control (Policy Improvement)
Seek a greedy policy based on the , which maximizes the expectation of .
Distributional Bellman optimality operator is not a Contraction Map, not even continuous in any Metric, causing the distributional instability. To overcome the problem, discrete action-value distributions are used to approximate the true distribution .