My Knowledge Base

❯

❯

Distributional Policy Iteration

Distributional Policy Iteration

Mar 12, 20261 min read

machine_learning/reinforcement_learning

Definition

Policy Evaluation

Compute $Z^{π}$ from the current policy $π$ .

Distributional Bellman operator $T^{π}$ for a fixed policy $π$ is a Contraction Map in a maximal form of Wasserstein Metric. $\overline{W}_{p} (Z_{1}, Z_{2}) = sup_{x, a} W_{p} (Z_{1} (x, a), Z_{2} (x, a))$

Thus, the iteration $Z \leftarrow T^{π} Z$ converges to a unique fixed point $Z^{π}$ (by Contraction Lemma).

Control (Policy Improvement)

Seek a greedy policy $π$ based on the $Z^{*}$ , which maximizes the expectation of $Z^{*}$ .

Distributional Bellman optimality operator $T$ is not a Contraction Map, not even continuous in any Metric, causing the distributional instability. To overcome the problem, discrete action-value distributions are used to approximate the true distribution $Z$ .

Graph View

Definition
Policy Evaluation
Control (Policy Improvement)

Backlinks

Reinforcement Learning Note

Created with Quartz v4.5.1 © 2026

GitHub
Discord Community