Definition

Monte Carlo method (MC) is tabular updating and model-free. MC policy iteration adapts GPI based on episode-by-episode (sample multi-step backup) of PE estimating $Q (s, a) = q_{π} (s, a)$ and $ϵ$ -greedy PI.

MC Prediction (Policy Evaluation)

Learn $q_{π}$ from entire episodes of real experience under policy $π$ .

Empirical Mean

MC policy evaluation uses empirical mean return $Q (s, a)$ instead of expected return. $Q (s, a) = \frac{S ( s , a )}{n ( s , a )}$ where $S (s, a)$ is increment total return and $n (s, a)$ is incremental count.

The mean return converges to true action-value function as incremental count increases. $lim_{n (s, a) \to \infty} Q (s, a) = q_{π} (s, a)$

Incremental MC Update

The incremental mean can be expressed in a recursive manner: $μ_{k} = μ_{k - 1} + \frac{1}{k} (x_{k} - μ_{k - 1})$

Thus, for each state-action pair $(S_{t}, A_{t})$ with return $G_{t}$ , after one episode, the mean return $Q (s, a)$ can be calculated as: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + \frac{1}{n ( S _{t} , A _{t} )} [G_{t} - Q (S_{t}, A_{t})]$

Constant- $α$ MC Policy Evaluation

It substitutes the weights, $\frac{1}{n ( S _{t} , A _{t} )}$ used for incremental MC, with constant step-size $α$ . $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [G_{t} - Q (S_{t}, A_{t})] = (1 - α) Q (S_{t}, A_{t}) + α G_{t}$

MC Control ( $ϵ$ -Greedy Policy Improvement)

Choose the greedy action with probability $1 - ϵ$ and a random action with probability $ϵ$ . $π^{'} (a ∣ s) = ⎩ ⎨ ⎧ 1 - ϵ + \frac{ϵ}{∣ A ( s ) ∣} \frac{ϵ}{∣ A ( s ) ∣} a = a^{'} argmax Q^{π} (s, a^{'}) otherwise$

For any $ϵ$ -greedy policy $π$ , $ϵ$ -greed policy $π^{'}$ with respect to $q_{π}$ is always improved i.e. $v_{π^{'}} (s) \geq v_{π} (s)$ .

$ϵ$ -greedy policy is GLIE if $ϵ_{k} \to 0$ e.g. $ϵ_{k} = \frac{1}{k}$

Algorithm

Initialize $\forall s \in S, a \in A (S), Q (S, A)$ arbitrarily, terminal $Q = 0$ , and arbitrary $ϵ$ -soft policy $π$ (non-empty probability).
Repeat for each episode:
1. Generate an episode following the policy $π$ : $S_{0}, A_{0}, R_{1}, S_{1}, \dots, S_{T - 1}, A_{T - 1}, R_{T}$ .
2. $G \leftarrow 0$
3. Repeat for each step of an episode, $t = T - 1, T - 2, \dots, 0$ (MC Prediction):
  1. $G \leftarrow γ G + R_{t + 1}$
  2. If $(S_{t}, A_{t})$ is a first visit ( $(S_{t}, A_{t}) \in / {(S_{i}, A_{i}) ∣ i = 0, \dots, t - 1}$ ), then $Q (S_{t}, A_{t}) \leftarrow A V G ({G_{t} ∣ t \in {0, 1, \dots, T}, s = S_{t}, a = A_{t}})$ where $A V G$ can be either incremental MC or constant- $α$ MC.
4. For each visited state $S_{t}$ in the episode (MC Control):
  1. $A^{*} \leftarrow a argmax Q (S_{t}, a)$
  2. $π (a ∣ S_{t}) = ⎩ ⎨ ⎧ 1 - ϵ + \frac{ϵ}{∣ A ( S ) ∣} \frac{ϵ}{∣ A ( S ) ∣} a = A^{*} otherwise$
If policy is stable, then stop and return $π \approx π_{*}$ .

Facts

MC is less harmed by violations of Markov Property because it updates value estimates based on true Return $G_{t}$ (not based on value estimates of successor states).

My Knowledge Base

Explorer

Monte Carlo Method

Definition

MC Prediction (Policy Evaluation)

Empirical Mean

Incremental MC Update

Constant- $α$ MC Policy Evaluation

MC Control ( $ϵ$ -Greedy Policy Improvement)

Algorithm

Facts

Graph View

Table of Contents

Backlinks

My Knowledge Base

Explorer

Monte Carlo Method

Definition

MC Prediction (Policy Evaluation)

Empirical Mean

Incremental MC Update

Constant-α MC Policy Evaluation

MC Control (ϵ-Greedy Policy Improvement)

Algorithm

Facts

Graph View

Table of Contents

Backlinks

Constant- $α$ MC Policy Evaluation

MC Control ( $ϵ$ -Greedy Policy Improvement)