Definition

Off-policy control uses two different policies for target policy and behavior policy . The target policy may be deterministic while the behavior policy is stochastic. It enables the model to learn optimal policy while maintaining exploration.

Almost all off-policy method utilize Importance Sampling. The Return obtained under the target policy is weighted according to the relative probability of the trajectories under the target and behavior policy, called the importance-sampling ratio (importance weight).

Given a starting state , the probability of the state-action trajectory occurring under a policy is written as where is the state-transition probability.

Thus, the relative probability of the trajectory under the target and behavior policies is

Algorithms

Off-policy MC

where is the importance weight.

The variance of return in Monte Carlo method can dramatically higher.

Off-policy Sarsa

The variance of return in off-policy TD is lower than MC’s one.