Definition

Policy iteration repeats policy evaluation and policy improvement until convergence. Its time complexity is where and are the numbers of states and actions respectively.
Compare to Value Iteration, policy iteration requires much fewer iterations to reach optimal policy.
Policy Evaluation
Compute from the deterministic policy .
Policy Improvement
Improve to by greedy policy based on .
Since , always either is strictly better than or is optimal when by the Policy Improvement Theorem.
Algorithm
- Initialize , and arbitrarily.
- Update every from all (full backup) until convergence to .
- Improve by greedy policy based on .
- If policy is stable, then stop and return and .
Examples
