Definition

Policy iteration repeats policy evaluation and policy improvement until convergence. Its time complexity is $O (S^{2})$ where $S$ and $A$ are the numbers of states and actions respectively.

Compare to Value Iteration, policy iteration requires much fewer iterations to reach optimal policy.

Policy Evaluation

Compute $V^{π}$ from the deterministic policy $π$ . $V_{k + 1} (s) \leftarrow s^{'}, r \sum p (s^{'}, r ∣ s, π (s)) [r + γ V_{k} (s^{'})]$

Policy Improvement

Improve $π$ to $π^{'}$ by greedy policy based on $V^{π}$ . $π^{'} (s) = a argmax s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γ V^{π} (s^{'})] = a argmax Q^{π} (s, a)$

Since $Q^{π} (s, π^{'} (s)) \geq V^{π} (s) = a \sum π (a ∣ s) Q^{π} (s, a)$ , always either $π^{'}$ is strictly better than $π$ or $π^{'}$ is optimal when $π^{'} = π$ by the Policy Improvement Theorem.

Algorithm

Initialize $\forall s, V (s) = 0$ , and $\forall s, π (s) \in A (s)$ arbitrarily.
Update every $V (s)$ from all $V (s^{'})$ (full backup) until convergence to $V^{π} (s)$ . $V (s) \leftarrow s^{'}, r \sum p (s^{'}, r ∣ s, π (s)) [r + γV (s^{'})]$
Improve $π$ by greedy policy based on $V^{π}$ . $π (s) \leftarrow a argmax s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γV (s^{'})]$
If policy is stable, then stop and return $V \approx v_{*}$ and $π \approx π_{*}$ .

My Knowledge Base

Explorer

Policy Iteration

Definition

Policy Evaluation

Policy Improvement

Algorithm

Examples

Graph View

Table of Contents

Backlinks