Definition

Off-policy control uses two different policies for target policy $π$ and behavior policy $μ$ . The target policy $π$ may be deterministic while the behavior policy $μ$ is stochastic. It enables the model to learn optimal policy while maintaining exploration.

Almost all off-policy method utilize Importance Sampling. The Return obtained under the target policy is weighted according to the relative probability of the trajectories under the target and behavior policy, called the importance-sampling ratio (importance weight).

Given a starting state $S_{t}$ , the probability of the state-action trajectory ${A_{t}, S_{t + 1}, A_{t + 1}, \dots, S_{T}}$ occurring under a policy $π$ is written as $P ({A_{t}, S_{t + 1}, A_{t + 1}, \dots, S_{T}} ∣ S_{t}, A_{t : T - 1} \sim π) = \prod_{k = t}^{T - 1} π (A_{k} ∣ S_{k}) p (S_{k + 1} ∣ S_{k}, A_{k})$ where $p$ is the state-transition probability.

Thus, the relative probability of the trajectory under the target and behavior policies is $ρ_{t : T - 1} = \prod_{k = t}^{T - 1} \frac{π ( A _{k} ∣ S _{k} )}{μ ( A _{k} ∣ S _{k} )}$

Algorithms

Off-policy MC

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [ρ_{t : T - 1} G_{t} - Q (S_{t}, A_{t})]$ where $ρ_{t : T - 1}$ is the importance weight.

The variance of return in Monte Carlo method can dramatically higher.

Off-policy Sarsa

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [\frac{π ( A _{t + 1} ∣ S _{t + 1} )}{μ ( A _{t + 1} ∣ S _{t + 1} )} (R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1})) - Q (S_{t}, A_{t})]$

The variance of return in off-policy TD is lower than MC’s one.

My Knowledge Base

Explorer

Off-Policy Control

Definition

Algorithms

Off-policy MC

Off-policy Sarsa

Graph View

Table of Contents

Backlinks