Markov Decision Process

Markov Property
Definition

Markov property refers to the memoryless property of a Stochastic Process.

Consider a Probability Space $(Ω, F, P)$ with a Filtration $(F_{s}, s \in I)$ , and a Measurable Space $(S, S)$ . A Stochastic Process $X = {X_{t} ∣Ω \to S}_{t \in I}$ is said to possess the Markov property if $\forall A \in S, \forall s, t \in I s.t. s < t, P (X_{t} \in A ∣ F_{s}) = P (X_{t} \in A ∣ X_{s})$ Here, $F_{s}$ in the LHS represents all information up to time $s$ , whereas $X_{s}$ conditions only on the single value at time $s$ .

In the case where $S$ is a discrete set and $I = N$ , this can be reformulated as follows $\forall n \in N, P (X_{t + 1} = x_{n + 1} ∣ X_{n} = x_{n}, \dots, X_{1} = x_{1}) = P (X_{n + 1} = x_{n + 1} ∣ X_{n} = x_{n})$
Link to original

Markov Chain
Definition

$(S, P)$ A Markov chain is a Stochastic Process ${X_{t}}$ that satisfies Markov Property. It consists of a set of states $S$ and a Transition Probability Matrix $P$

Summary

Conditions and Properties of Markov Chain (Finite States)

Example Finite IRR Aperiodicity $\exists! p^{⋆}$ ^[a unique Limiting Distribution] $\exists! π$ ^[a unique Stationary Distribution] $\overset{ˉ}{π} \to π$ ^[Ergodic Theorem] $\overset{ˉ}{π} \to π = p^{⋆}$ ^[Ergodicity]
O X X X X X X
Identity/Block matrix O X O X X X X
Exchange matrix O O X X O O X
Ergodic Markov Chain O O O O O O O
flowchart LR
  mc[HMC] --> finite
  
  subgraph finiteness
    finite([finite])
  end
  
  finite --> finite_mc
  finite_mc --> irreducible
  finite_mc --> reducible
  
  subgraph Irreducibility
    irreducible([irreducible])
    reducible([reducible])
  end
  irreducible --> irr_mc[Irreducible MC]
  reducible -->|divide|irr_mc

  irr_mc ---|holds|egt([Ergodic Theorem])
  irr_mc --> aperiodic
  
  subgraph Aperiodicity
    aperiodic([aperiodic])
  end
  
  aperiodic --> ergmc[Ergodic MC]
Conditions and Properties of Markov Chain (Infinite States)

IRR Nature Aperiodicity $\exists! p^{⋆}$ ^[a unique Limiting Distribution] $\exists! π$ ^[a unique Stationary Distribution] $\overset{ˉ}{π} \to π$ ^[Ergodic Theorem] $\overset{ˉ}{π} \to π = p^{⋆}$ ^[Ergodicity]
O Transient can’t define X X X X
O Null recurrent no matter X X X X
O Positive recurrent X X O O X
O Positive recurrent O O O O O
flowchart LR
  mc[HMC]
  mc --> irreducible
  mc --> reducible
  
  subgraph Irreducibility
    irreducible([irreducible])
    reducible([reducible])
  end
  
  irr_mc[Irreducible MC]
  irreducible --> irr_mc
  reducible -->|divide|irr_mc
  
  irr_mc --> pr
  irr_mc --> nr
  irr_mc --> tr
  
  subgraph Recurrence
    pr([positive recurrent])
    nr([null recurrent])
    tr([transient])
  end
    
  pr_mc[Recurrent MC]
  pr --> pr_mc
  pr_mc --> aperiodic

  egt([Ergodic Theorem])
  pr_mc ---|holds|egt

  subgraph Aperiodicity
    aperiodic([aperiodic])
  end
  
  aperiodic --> ergmc[Ergodic MC]
Link to original

Example	Finite	IRR	Aperiodicity	$\exists! p^{⋆}$ ^[a unique Limiting Distribution]	$\exists! π$ ^[a unique Stationary Distribution]	$\overset{ˉ}{π} \to π$ ^[Ergodic Theorem]	$\overset{ˉ}{π} \to π = p^{⋆}$ ^[Ergodicity]
	O	X	X	X	X	X	X
Identity/Block matrix	O	X	O	X	X	X	X
Exchange matrix	O	O	X	X	O	O	X
Ergodic Markov Chain	O	O	O	O	O	O	O

IRR	Nature	Aperiodicity	$\exists! p^{⋆}$ ^[a unique Limiting Distribution]	$\exists! π$ ^[a unique Stationary Distribution]	$\overset{ˉ}{π} \to π$ ^[Ergodic Theorem]	$\overset{ˉ}{π} \to π = p^{⋆}$ ^[Ergodicity]
O	Transient	can’t define	X	X	X	X
O	Null recurrent	no matter	X	X	X	X
O	Positive recurrent	X	X	O	O	X
O	Positive recurrent	O	O	O	O	O

Transition
Definition

$μ_{t + 1} = P^{⊺ μ_{t}} \Leftrightarrow μ_{t + 1}^{⊺} = μ_{t}^{⊺} P$

where $μ$ is the distribution of states and $P$ is Transition Matrix
Link to original

Transition Probability
Definition

$p_{ij} = P (j ∣ i) = P (X_{t + 1} = j ∣ X_{t} = i)$ where ${X_{t}}_{t \geq 0}$ is HMC that has countable state space $E$

The conditional probability of the Transition from state $i$ to $j$

Facts

$p_{ij} \geq 0$

Link to original

Transition Probability Matrix
Definition

$P = [P_{ij}]$ A transition probability matrix $P$ is a matrix whose $(i, j)$ -th element is $p_{ij}$ . where each $p_{ij}$ is a Transition Probability of HMC ${X_{t}}_{t \geq 0}$ having countable state space

Facts

Since the number of elements of state space can be infinity, strictly speaking $P$ is not a matrix. However, several operations of matrix are satisfied for the $P$

$\sum_{j \in E} p_{ij} = 1$ The sum of each row of a transition matrix is one

Link to original

Markov Decision Processes
Definition

Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows)

Markov decision process (MDP) is a tuple $(S, A, P, R)$ . Where:

$S$ : A set of states (State space)

$A$ : A set of actions (Action space)

$P_{s s^{'}}^{a} = p (s^{'} ∣ s, a) = P (S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a)$ : A Transition Probability from $s$ to $s^{'}$ given $a$ .

$R_{s s^{'}}^{a} = r (s, a, s^{'})$ : The immediate Reward received after transitioning from state $s$ to $s^{'}$ due to action $a$ .

Facts

Environmental model in MDP is the Transition Probability. If the Transition Probability is known, it is called a model-based MDP, otherwise called a model-free MDP.

Any Markov decision process satisfies the followings:

There exists an optimal policy $\exists π_{*} s.t. \forall π, \forall s, v_{π_{*}} (s) \geq v_{π} (s)$

All optimal policies achieve the optimal state-value function $v_{π_{*}} (s) = v_{*} (s)$

All optimal policies achieve the optimal action-value function $q_{π_{*}} (s, a) = q_{*} (s, a)$

Link to original

Reward
Definition

Reward $R_{t}$ is a scalar feedback indicating how well the agent is doing at step $t$ .

There are three reward types:

$R_{s} = r (s)$ : the reward in the state $s$ .

$R_{s}^{a} = r (s, a)$ : the reward by the action $a$ in the state $s$ .

$R_{s s^{'}}^{a} = r (s, a, s^{'})$ : the reward received after transitioning from state $s$ to $s^{'}$ due to action $a$ .

Expected Reward

Under known dynamics $p (s^{'}, r ∣ s, a)$ of all transitions $(s, a, s^{'}, r)$ , one can compute the followings:

State transition probability $P_{s s^{'}}^{a} = p (s^{'} ∣ s, a) = P (S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a) = r \in R \sum p (s^{'}, r ∣ s, a)$

Expected reward for (state, action) pair $R_{s}^{a} = r (s, a) = E [R_{t + 1} ∣ S_{t} = s, A_{t} = a] = r \in R \sum r s^{'} \in S \sum p (s^{'}, r ∣ s, a)$

Expected reward for (state, action, next state) triple $R_{s s^{'}}^{a} = r (s, a, s^{'}) = E [R_{t + 1} ∣ S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}] = \frac{r \in R \sum r p ( s ^{'} , r ∣ s , a )}{p ( s ^{'} ∣ s , a )}$

Facts

Reinforcement learning is based on the Reward Hypothesis.

Link to original

Reward Hypothesis
Definition

All goals can be described by the maximization of the expected value of the cumulative sum of rewards.
Link to original

Return
Definition

$G_{t} := k = 0 \sum \infty γ^{k} R_{t + k + 1} = R_{t + 1} + γ G_{t + 1}$ where $γ \in [0, 1]$ is a discount factor

Return $G_{t}$ is the total discounted Reward from time step $t$ .
Link to original

Policy
Definition

Deterministic Policy

$π : S \to A, π (s)$ A deterministic policy $π$ is a mapping from state space $S$ to action space $A$ .

Stochastic Policy

$π (a ∣ s) = P (A_{t} = a ∣ S_{t} = s)$ A stochastic policy $π$ is a probability distribution over actions $A$ for a given state $S$ .

Optimal Policy

$π_{*} (s) = a argmax q_{*} (s, a)$ A policy maximizing the cumulative sum of rewards is called an optimal policy.

Better Policy

Consider two policies $π$ and $π^{'}$ . $π^{'}$ is better policy than $π$ if $\forall s, v_{π^{'}} (s) \geq v_{π} (s)$

Facts

Under known MDP, there exists deterministic optimal policy $π_{*} (s)$ .

Under unknown MDP, stochastic policy is needed.

Link to original

State-Value Function
Definition

$v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = a \sum π (a ∣ s) q_{π} (s, a)$ where $q_{π} (s, a)$ is a Action-Value Function.

A state-value function $v_{π} (s)$ is the expected discounted Return starting in state $s$ under policy $π$ .

Optimal State-Value Function

$v_{*} (s) = v_{π_{*}} (s) = max_{π} v_{π} (s) = max_{a} q_{*} (s, a)$ An optimal state-value function is the maximum possible state-value function (a state-value function under optimal policy).
Link to original

Action-Value Function
Definition

$q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$ An action-value function $q_{π} (s, a)$ is the expected discounted Return of taking action $a$ starting in state $s$ under policy $π$ .

Optimal Action-Value Function

$q_{*} (s, a) = q_{π_{*}} (s, a) = max_{π} q_{π} (s, a) = r (s, a) + γ s^{'} \sum p (s^{'} ∣ s, a) v_{*} (s^{'})$ An optimal action-value function is the maximum possible action-value function (an action-value function under optimal policy).
Link to original

Advantage Function
Definition

$A_{π} (s, a) = q_{π} (s, a) - v_{π} (s)$ An advantage function $A_{π} (s, a)$ is the difference between the Action-Value Function $q_{π} (s, a)$ and the State-Value Function $v_{π} (s)$ under policy $π$ . It represents how much better it is to take action $a$ in state $s$ compared to the average action.
Link to original

Bellman Equation

Law of Total Probability
Definition

Suppose ${B_{n} ∣ n \in N}$ is a set of mutually and collectively exclusive events, then for any event $A$ : $P (A) = n \sum P (A ∣ B_{n}) P (B_{n}) = n \sum P (A, B_{n})$
Link to original

Law of Large Numbers
Definition

Weak Law of Large Numbers

Consider a Sequence ${X_{n}}$ of i.i.d. random variables with mean $E [X_{i}] = μ$ , variance $σ^{2} < \infty$ . Let $\overset{ˉ}{X}_{n} = \frac{1}{n} i = 1 \sum n X_{i}$ be the sample average. Then, the sample average converges in probability to the expected value. $lim_{n \to \infty} \overset{ˉ}{X}_{n} \to P μ$

Strong Law of Large Numbers

Consider a Sequence ${X_{n}}$ of i.i.d. random variables with mean $E [X_{i}] = μ$ , variance $σ^{2} < \infty$ . Let $\overset{ˉ}{X}_{n} = \frac{1}{n} i = 1 \sum n X_{i}$ be the sample average. Then, the sample average converges almost surely to the expected value. $lim_{n \to \infty} \overset{ˉ}{X}_{n} \to a . s . μ$
Link to original

Bellman Expectation Equation
Definition

Bellman expectation equation is a recursive equation decomposing State-Value Function $v_{π} (s)$ into immediate Reward $R_{s}^{a}$ and discounted future state-value $γ v_{π} (s^{'})$ .

Bellman Equation for State-Value Function

$v_{\pi}(s) &= E_{\pi}[G_{t} | S_{t}=s]\\ &= P(A_{t}=a|S_{t}=s)\cdot E_{\pi}[G_{t} | S_{t}=s, A_{t}=a]\quad \left( = \sum\limits_{a} \pi(a|s) q_{\pi}(s, a) \right)\\ &= \sum\limits_{a} \pi(a|s) \cdot E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_{t}=s, A_{t}=a]\\ &= \sum\limits_{a} \pi(a|s) \cdot \sum\limits_{s', r} E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_{t}=s, A_{t}=a, S_{t+1}=s', R_{t+1}=r]\cdot P(S_{t+1}=s', R_{t+1}=r | S_{t}=s, A_{t}=a)\\ &= \sum\limits_{a} \pi(a|s) \cdot \sum\limits_{s', r} p(s', r | s, a)[r + \gamma E_{\pi}[G_{t+1} | S_{t+1}=s']]\quad \text{(by Markov property)}\\ &= \sum\limits_{a} \pi(a|s) \cdot \sum\limits_{s', r} p(s', r | s, a)[r + \gamma v_{\pi}(s')]\quad \left( =\sum\limits_{a} \pi(a|s)[R_{s}^{a} + \gamma \sum\limits_{s'}P_{ss'}^{a} v_{\pi}(s')] \right)\\ &= E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_{t}=s] \end{aligned}$$ ## Bellman Equation for [[Action-Value Function]] ![[Pasted image 20241207130544.png|450]] $$\begin{aligned} q_{\pi}(s, a) &= E_\pi[G_{t} | S_{t}=s, A_{t}=a]\\ &= \sum\limits_{s'r}p(s', r | s, a)[r + \gamma \underbrace{\sum\limits_{a'} \pi(a'|s') q_{\pi}(s', a')}_{v_{\pi}(s')}]\quad \left( = R_{s}^{a} + \gamma \sum\limits_{s'} P_{ss'}^{a} \underbrace{\sum\limits_{a'} \pi(a'|s') q_{\pi}(s', a')}_{v_{\pi}(s')} \right)\\ &= E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_{t}=s, A_{t}=a] \end{aligned}$$$ Link to original

Bellman Optimality Equation
Definition

Under model-based (known MDP), the optimal value functions can be iteratively evaluated by dynamic programming.

Bellman Optimality Equation for State-Value Function

$v_{*}(s) &= \max_{a} q_{*}(s, a)\\ &= \max_{a} E_{\pi_{*}}[G_{t} | S_{t}=s, A_{t}=a]\\ &= \max_{a} E_{\pi_{*}}[R_{t+1} + \gamma G_{t+1} | S_{t}=s, A_{t}=a]\\ &= \max_{a} E_{s',r}[R_{t+1} + \gamma v_{*}(S_{t+1}) | S_{t}=s, A_{t}=a]\\ &= \max_{a} \sum\limits_{s',r}p(s', r | s, a) [r + \gamma v_{*}(s')]\quad \left( =\max_{a} [R_{s}^{a} + \gamma\sum\limits_{s'}P_{ss'}^{a} v_{*}(s')] \right) \end{aligned}$$ ## Bellman Optimality Equation for [[Action-Value Function]] ![[Pasted image 20241207152144.png|450]] $$\begin{aligned} q_{*}(s, a) &= E_{s',r}[R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') | S_{t}=s, A_{t}=a]\\ &= \sum\limits_{s',r}p(s',r | s, a)[r + \gamma \max_{a'} q_{*}(s', a')]\quad \left(= R_{s}^{a} + \gamma \sum\limits_{s'}P_{ss'}^{a} \max_{a'} q_{*}(s', a') \right) \end{aligned}$$$ Link to original

Dynamic Programming

Dynamic Programming
Definition

Dynamic programming is a method for solving complex problems by breaking them down into simpler sub-problems in a recursive manner.

Dynamic programming works when a problem has the following properties:

Optimal substructure: An optimal solution can be constructed from optimal solutions of its sub-problems.

Overlapping sub-problems: Solutions of same sub-problems are reused several times.

Algorithms

Value Iteration

Policy Iteration
Link to original

Value Iteration
Definition

Value iteration iteratively updates state-value function until convergence. Its time complexity is $O (S^{2} A)$ where $S$ and $A$ are the numbers of states and actions respectively.

Algorithm

Initialize $\forall s, V (s) = 0$

Update $V (s)$ iteratively from all $V (s^{'})$ (full backup) until convergence to $V^{*} (s)$ . $V (s) \leftarrow max_{a} s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γV (s^{'})]$

Synchronoius backups: compute $V_{k + 1} (s)$ for all $s$ and update simultaneously.

Asynchronoius backups: compute $V_{k + 1} (s)$ for one $s$ and update it immediately.

Compute the optimal policy $π_{*}$ (one-step lookahead) and return it. $π_{*} (s) = a argmax Q^{*} (s, a) s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γ V^{*} (s^{'})]$

Link to original

Policy Improvement Theorem
Definition

Consider two policies $π$ and $π^{'}$ . $\forall s \in S, Q^{π} (s, π^{'} (s)) \geq V^{π} (s) (=) ⟹ \forall s \in S, V^{π^{'}} (s) \geq V^{π} (s)$ This implies that $π^{'}$ is a better policy than $π$ .
Link to original

Policy Iteration
Definition

Policy iteration repeats policy evaluation and policy improvement until convergence. Its time complexity is $O (S^{2})$ where $S$ and $A$ are the numbers of states and actions respectively.

Compare to Value Iteration, policy iteration requires much fewer iterations to reach optimal policy.

Policy Evaluation

Compute $V^{π}$ from the deterministic policy $π$ . $V_{k + 1} (s) \leftarrow s^{'}, r \sum p (s^{'}, r ∣ s, π (s)) [r + γ V_{k} (s^{'})]$

Policy Improvement

Improve $π$ to $π^{'}$ by greedy policy based on $V^{π}$ . $π^{'} (s) = a argmax s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γ V^{π} (s^{'})] = a argmax Q^{π} (s, a)$

Since $Q^{π} (s, π^{'} (s)) \geq V^{π} (s) = a \sum π (a ∣ s) Q^{π} (s, a)$ , always either $π^{'}$ is strictly better than $π$ or $π^{'}$ is optimal when $π^{'} = π$ by the Policy Improvement Theorem.

Algorithm

Initialize $\forall s, V (s) = 0$ , and $\forall s, π (s) \in A (s)$ arbitrarily.

Update every $V (s)$ from all $V (s^{'})$ (full backup) until convergence to $V^{π} (s)$ . $V (s) \leftarrow s^{'}, r \sum p (s^{'}, r ∣ s, π (s)) [r + γV (s^{'})]$

Improve $π$ by greedy policy based on $V^{π}$ . $π (s) \leftarrow a argmax s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γV (s^{'})]$

If policy is stable, then stop and return $V \approx v_{*}$ and $π \approx π_{*}$ .

Examples

Link to original

Reinforcement Learning

Generalized Policy Iteration
Definition

Generalized policy iteration uses the repeatedly approximated value function to the true value of the current policy (sample backup) and the policy is repeatedly improved to approach the optimality.
Link to original

Monte Carlo Method
Definition

Monte Carlo method (MC) is tabular updating and model-free. MC policy iteration adapts GPI based on episode-by-episode (sample multi-step backup) of PE estimating $Q (s, a) = q_{π} (s, a)$ and $ϵ$ -greedy PI.

MC Prediction (Policy Evaluation)

Learn $q_{π}$ from entire episodes of real experience under policy $π$ .

Empirical Mean

MC policy evaluation uses empirical mean return $Q (s, a)$ instead of expected return. $Q (s, a) = \frac{S ( s , a )}{n ( s , a )}$ where $S (s, a)$ is increment total return and $n (s, a)$ is incremental count.

The mean return converges to true action-value function as incremental count increases. $lim_{n (s, a) \to \infty} Q (s, a) = q_{π} (s, a)$

Incremental MC Update

The incremental mean can be expressed in a recursive manner: $μ_{k} = μ_{k - 1} + \frac{1}{k} (x_{k} - μ_{k - 1})$

Thus, for each state-action pair $(S_{t}, A_{t})$ with return $G_{t}$ , after one episode, the mean return $Q (s, a)$ can be calculated as: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + \frac{1}{n ( S _{t} , A _{t} )} [G_{t} - Q (S_{t}, A_{t})]$

Constant- $α$ MC Policy Evaluation

It substitutes the weights, $\frac{1}{n ( S _{t} , A _{t} )}$ used for incremental MC, with constant step-size $α$ . $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [G_{t} - Q (S_{t}, A_{t})] = (1 - α) Q (S_{t}, A_{t}) + α G_{t}$

MC Control ( $ϵ$ -Greedy Policy Improvement)

Choose the greedy action with probability $1 - ϵ$ and a random action with probability $ϵ$ . $π^{'} (a ∣ s) = ⎩ ⎨ ⎧ 1 - ϵ + \frac{ϵ}{∣ A ( s ) ∣} \frac{ϵ}{∣ A ( s ) ∣} a = a^{'} argmax Q^{π} (s, a^{'}) otherwise$

For any $ϵ$ -greedy policy $π$ , $ϵ$ -greed policy $π^{'}$ with respect to $q_{π}$ is always improved i.e. $v_{π^{'}} (s) \geq v_{π} (s)$ .

$ϵ$ -greedy policy is GLIE if $ϵ_{k} \to 0$ e.g. $ϵ_{k} = \frac{1}{k}$

Algorithm

Initialize $\forall s \in S, a \in A (S), Q (S, A)$ arbitrarily, terminal $Q = 0$ , and arbitrary $ϵ$ -soft policy $π$ (non-empty probability).

Repeat for each episode:

Generate an episode following the policy $π$ : $S_{0}, A_{0}, R_{1}, S_{1}, \dots, S_{T - 1}, A_{T - 1}, R_{T}$ .

$G \leftarrow 0$

Repeat for each step of an episode, $t = T - 1, T - 2, \dots, 0$ (MC Prediction):

$G \leftarrow γ G + R_{t + 1}$

If $(S_{t}, A_{t})$ is a first visit ( $(S_{t}, A_{t}) \in / {(S_{i}, A_{i}) ∣ i = 0, \dots, t - 1}$ ), then $Q (S_{t}, A_{t}) \leftarrow A V G ({G_{t} ∣ t \in {0, 1, \dots, T}, s = S_{t}, a = A_{t}})$ where $A V G$ can be either incremental MC or constant- $α$ MC.

For each visited state $S_{t}$ in the episode (MC Control):

$A^{*} \leftarrow a argmax Q (S_{t}, a)$

$π (a ∣ S_{t}) = ⎩ ⎨ ⎧ 1 - ϵ + \frac{ϵ}{∣ A ( S ) ∣} \frac{ϵ}{∣ A ( S ) ∣} a = A^{*} otherwise$

If policy is stable, then stop and return $π \approx π_{*}$ .

Facts

MC is less harmed by violations of Markov Property because it updates value estimates based on true Return $G_{t}$ (not based on value estimates of successor states).

Link to original

Time Difference Learnings

GLIE policy
Definition

A learning Policy is called GLIE (Greedy in Limit with Infinite Exploration) if it satisfies:

All state-action pairs are explored infinitely many times $\forall s \in S, a \in A (S), lim_{k \to \infty} n (s, a) = \infty$ where $n (s, a)$ is incremental count of a $(s, a)$ .

The learning policy converges to a greedy policy. $lim_{k \to \infty} π_{k} (a ∣ s) = 1$ where $a = a^{'} argmax Q_{k} (s, a^{'})$

Link to original

Temporal Difference Learning

Temporal Difference Learning (TD) is tabular updating and model-free. TD policy iteration adapts GPI based on one-step transitions of sample episodes.

Algorithms

Sarsa

Q-Learning

Examples

Sarsa allows for potential penalties from exploration moves, which tends to make it avoid a dangerous optimal path and prefers a slower but safer path. In contrast, Q-Learning ignores these penalties and takes action with the highest action value, which results in its occasional falling because of $ϵ$ -greedy actions.

Sarsa allows for potential penalties from exploration moves, which tends to make it avoid a dangerous optimal path and prefer a slower but safer path. In contrast, Q-Learning ignores these penalties and takes action based on the highest action value, which can occasionally result in falling off due to $ϵ$ -greedy action.
Link to original

Sarsa
Definition

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$ Sarsa learns $Q^{π}$ by using Bellman Expectation Equation. The both behavior and target policies of Sarsa is derived from $Q$ , so it is an on-policy method.

Algorithm

Initialize $\forall s \in S, a \in A (S), Q (s, a)$ arbitrarily, and $Q (terminal, \cdot) = 0$ .

Repeat for each episode:

Initialize $S$ .

Choose an action $A$ from the initial state $S$ using policy derived from $Q$ (e.g. $ϵ$ -greedy).

Repeat for each step of an episode until $S$ is terminal:

Take the action $A$ and observe a reward $R$ and a next state $S^{'}$ .

Choose an action $A^{'}$ from the next state $S^{'}$ using policy derived from $Q$ (e.g. $ϵ$ -greedy).

$Q (S, A) \leftarrow Q (S, A) + α [R + γ Q (S^{'}, A^{'}) - Q (S, A)]$

Update $S \leftarrow S^{'}$ and $A \leftarrow A^{'}$

Link to original

Q-Learning
Definition

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$ Q-learning directly learns $Q^{*}$ by using Bellman Optimality Equation. The behavior policy of Q-learning is derived from $Q$ while the target policy is $max_{a} Q$ , so it is an off-policy method.

Algorithm

Initialize $\forall s \in S, a \in A (S)$ , $Q (s, a)$ arbitrarily, and $Q (terminal, \cdot) = 0$ .

Repeat for each episode:

Initialize $S$ .

Repeat for each step of an episode until $S$ is terminal:

Choose an action $A$ from the initial state $S$ using policy derived from $Q$ (e.g. $ϵ$ -greedy).

Take the action $A$ and observe a reward $R$ and a next state $S^{'}$ .

$Q (S, A) \leftarrow Q (S, A) + α [R + γ a max Q (S^{'}, a) - Q (S, A)]$

Update $S \leftarrow S^{'}$

Link to original

Importance Sampling
Definition

Importance sampling is used when we want to estimate the expectation $E_{p (x)} [f (x)]$ of a function $f$ under a probability distributtion $p (x)$ , but sampling directly from the distribution is either inefficient or impossible. $E_{p (x)} [f (x)] = \int f (x) \frac{p ( x )}{q ( x )} q (x) d x = E_{q (x)} [f (x) \frac{p ( x )}{q ( x )}] \approx \frac{1}{N} n = 1 \sum N \frac{p ( x _{n} )}{q ( x _{n} )} f (x_{n})$ Where

$p (x)$ is the target distribution

$q (x)$ is the proposal distribution whose support includes that of the support of the target distribution $p (x)$ ( $\forall x, p (x) > 0 \Rightarrow q (x) > 0$ ). The proposal distribution needs to be easy to evaluate and sample from. Also, to reduce the variance of the expectation, $q (x)$ should be high where $∣ p (x) f (x) ∣$ is high.

$\frac{p ( x )}{q ( x )}$ is the importance weight.

Link to original

Off-Policy Control
Definition

Off-policy control uses two different policies for target policy $π$ and behavior policy $μ$ . The target policy $π$ may be deterministic while the behavior policy $μ$ is stochastic. It enables the model to learn optimal policy while maintaining exploration.

Almost all off-policy method utilize Importance Sampling. The Return obtained under the target policy is weighted according to the relative probability of the trajectories under the target and behavior policy, called the importance-sampling ratio (importance weight).

Given a starting state $S_{t}$ , the probability of the state-action trajectory ${A_{t}, S_{t + 1}, A_{t + 1}, \dots, S_{T}}$ occurring under a policy $π$ is written as $P ({A_{t}, S_{t + 1}, A_{t + 1}, \dots, S_{T}} ∣ S_{t}, A_{t : T - 1} \sim π) = \prod_{k = t}^{T - 1} π (A_{k} ∣ S_{k}) p (S_{k + 1} ∣ S_{k}, A_{k})$ where $p$ is the state-transition probability.

Thus, the relative probability of the trajectory under the target and behavior policies is $ρ_{t : T - 1} = \prod_{k = t}^{T - 1} \frac{π ( A _{k} ∣ S _{k} )}{μ ( A _{k} ∣ S _{k} )}$

Algorithms

Off-policy MC

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [ρ_{t : T - 1} G_{t} - Q (S_{t}, A_{t})]$ where $ρ_{t : T - 1}$ is the importance weight.

The variance of return in Monte Carlo method can dramatically higher.

Off-policy Sarsa

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [\frac{π ( A _{t + 1} ∣ S _{t + 1} )}{μ ( A _{t + 1} ∣ S _{t + 1} )} (R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1})) - Q (S_{t}, A_{t})]$

The variance of return in off-policy TD is lower than MC’s one.
Link to original

Expected Sarsa
Definition

$Q(S_{t}, A_{t}) &\leftarrow Q(S_{t}, A_{t}) + \alpha [R_{t+1} + \gamma E_{\pi}[Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_{t}, A_{t})]\\ &(= Q(S_{t}, A_{t}) + \alpha [R_{t+1} + \gamma \sum\limits_{a \in \mathcal{A}(S_{t+1})} Q(S_{t+1}, a)\pi(a | S_{t+1}) - Q(S_{t}, A_{t})) \end{aligned}$$ Expected Sarsa uses the expected value over the next state-action pairs, so all actions viable in the state are considered. Expected Sarsa performs better than [[Sarsa]] in terms of the variance of its [[Return]] but costs more.$ Link to original

Double Q-Learning
Definition

Q-Learning takes the action with the highest value, which tends to overestimate the value of actions (maximization bias) in the early stages of learning, slowing the convergence to $q_{*} (s, a)$ .

Double Q-learning uses two Q-value functions to overcome this problem: one for action selection and another for value evaluation.

Algorithm

Initialize $\forall s \in S, a \in A (S)$ , $Q_{1} (s, a)$ , and $Q_{2} (s, a)$ arbitrarily, and $Q_{1} (terminal, \cdot) = 0$ and $Q_{2} (terminal, \cdot) = 0$ .

Repeat for each episode:

Initialize $S$ .

Repeat for each step of an episode until $S$ is terminal:

Choose an action $A$ from the initial state $S$ using policy derived from $Q_{1}$ and $Q_{2}$ (e.g. $ϵ$ -greedy in $Q_{1} + Q_{2}$ ).

Take the action $A$ and observe a reward $R$ and a next state $S^{'}$ .

With probability $0.5$ : $Q_{1} (S, A) \leftarrow Q_{1} (S, A) + α [R + γ Q_{2} (S^{'}, a argmax Q_{1} (S^{'}, a)) - Q_{1} (S, A)]$ else: $Q_{2} (S, A) \leftarrow Q_{2} (S, A) + α [R + γ Q_{1} (S^{'}, a argmax Q_{2} (S^{'}, a)) - Q_{2} (S, A)]$

Update $S \leftarrow S^{'}$

Examples

Q-Learning initially learns to take the left action much more often than the right action despite the lower true state-value of the left. Even at asymptote, it takes the left action more than optimal (5%). In contrast, double Q-learning is unaffected by maximization bias.
Link to original

n-Step Bootstrapping

n-Step Return
Definition
$G_{t:t+n} :&= \sum\limits_{k=0}^{n-1}\gamma^{k} R_{t+k+1} + \gamma^{n}V(S_{t+n}) \\ &= R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^{n}V(S_{t+n}) \end{aligned}$$ n-step returns can be considered approximations to the full [[Return]], truncated after $n$ steps and then corrected for remaining missing terms by the [[State-Value Function|state-value]] estimator $V(S_{t+n})$. # Examples - $G_{t:t+1}$: (one-step) [[Temporal Difference Learning|TD]] target - $G_{t:t+n}$: [[n-Step TD]] target - $G_{t:\infty}$: [[Monte Carlo Method|MC]] target$ Link to original

n-Step TD
Definition

$V (S_{t}) \leftarrow V (S_{t}) + α [G_{t : t + n} - V (S_{t})]$ where $G_{t : t + n}$ is the n-Step Return.

The n-step TD learning is a method that bridges the gap between TD and MC methods.

Examples

$n = 1$ : Temporal Difference Learning

$n = \infty$ : Monte Carlo Method

Larger $n$ reduces bias but increases variance.
Link to original

n-Step Sarsa
Definition

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [G_{t : t + n} - Q (S_{t}, A_{t})]$ where $G_{t : t + n} := k = 0 \sum n - 1 γ^{k} R_{t + k + 1} + γ^{n} Q (S_{t + n}, A_{t + n})$ is the n-Step Return.
Link to original

n-step Off-Policy Learning
Definition

The relative probability of the n-step trajectory under the target and behavior policies is $ρ_{t : t + n} = \prod_{k = t}^{t + n} \frac{π ( A _{k} ∣ S _{k} )}{μ ( A _{k} ∣ S _{k} )}$

Algorithms

Off-policy n-Step TD

$V (S_{t}) \leftarrow V (S_{t}) + α [ρ_{t : t + n} G_{t : t + n} - V (S_{t})]$ where $ρ_{t : t + n}$ is the importance weight.

Off-policy n-Step Sarsa

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [ρ_{t : t + n} G_{t : t + n} - Q (S_{t}, A_{t})]$ where $ρ_{t : t + n}$ is the importance weight, and $G_{t : t + n} := k = 0 \sum n - 1 γ^{k} R_{t + k + 1} + γ^{n} Q (S_{t + n}, A_{t + n})$ .
Link to original

Eligibility Traces

lambda-Return
Definition

The $λ$ -return is a way to combine $n$ -step returns for different $n$ ‘s using a weighted average.
$G_{t}^{\lambda} :&= (1-\lambda) \sum\limits_{n=1}^{\infty}\lambda^{n-1} G_{t:t+n}\\ &= (1-\lambda) \sum\limits_{n=1}^{T-t-1}\lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1}G_{t} \end{aligned}$$ where $G_{t:t+n}$ is the [[n-Step Return]]. # Examples - $\lambda=0$: [[Temporal Difference Learning|TD]] target - $\lambda =1$: [[Monte Carlo Method|MC]] return - $0 < \lambda < 1$: creates a smooth blend of all $n$-step returns.$ Link to original

TD(lambda)
Definition

TD( $λ$ ) is a TD algorithm that uses the lambda-Return for value function updates. $V (S_{t}) \leftarrow V (S_{t}) + α [G_{t}^{λ} - V (S_{t})]$ where $G_{t}^{λ}$ is the lambda-Return

Examples

$T D (0)$ : Temporal Difference Learning

$T D (1)$ : Monte Carlo Method

Link to original

Deep Reinforcement Learning

DQNs

Deep Q-Network
Definition

DQN has large or continuous state space and discrete action space.

Naive DQN

Naive DQN treats $r_{t + 1} + γ max_{a} Q (s_{t + 1}, a; θ)$ as a target and minimizes MSE loss by SGD. $L (θ) = [r_{t + 1} + γ max_{a} Q (s_{t + 1}, a; θ) - Q (s_{t}, a_{t}; θ)]^{2}$

Due to the training instability and correlated samples, the Naive DQN has very poor results, even worse than a linear model.

Experience Replay

Online RL agents incrementally update parameters while observing a stream of experience. This structure cause strongly temporally-correlated updates, breaking i.i.d. assumption, and rapidly forget rare experiences that would be useful later on. Experience replay stores experiences $(s, a, r, s^{'})$ in the replay buffer, and randomly samples temporally uncorrelated minibatches from the replay buffer when learning.

Target Network

If target function is changed too frequently, then this moving target makes training difficult (non-stationary target problem). The target network technique updates the parameters of the behavior Q-network $Q (\cdot; θ)$ at every step, while updating the parameters of the target Q-network $\hat{Q} (\cdot; \hat{θ})$ sporadically (e.g. every $k$ steps). $L (θ) = \frac{1}{B} i \in I \sum [r_{i + 1} + γ max_{a} \hat{Q} (s_{i + 1}, a; \hat{θ}) - Q (s_{i}, a_{i}; θ)]^{2}$ where $B$ is the size of a minibatch, and $I$ is a $B$ -sized index set drew from the replay buffer.

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .

Repeat for each episode:

Initialize sequence $s_{0}$ .

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .

Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i + 1}, s_{i + 1})$ from $R$ .

$y_{i} \leftarrow r_{i + 1} + γ a max \hat{Q} (s_{i + 1}, a; \hat{θ})$ .

Perform Gradient Descent on loss $L (θ) = (y_{i} - Q (s_{i}, a_{i}; θ))^{2}$ .

Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.

Update $s_{t} \leftarrow s_{t + 1}$ .

Link to original

Double DQN
Definition

Double DQN is a variation of DQN that uses the idea of Double Q-Learning to reduce overestimation. In this case, the target network in DQN works as the second network for double Q-learning without introducing an additional network.

$L (θ) = [r_{t + 1} + γ \hat{Q} (s_{t + 1}, a argmax Q (s_{t + 1}, a; θ); \hat{θ}) - Q (s_{t}, a_{t}; θ)]^{2}$

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .

Repeat for each episode:

Initialize sequence $s_{0}$ .

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .

Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i + 1}, s_{i + 1})$ from $R$ .

$y_{i} \leftarrow r_{i + 1} + γ \hat{Q} (s_{i + 1}, a argmax Q (s_{i + 1}, a; θ); \hat{θ})$ .

Perform Gradient Descent on loss $L (θ) = (y_{i} - Q (s_{i}, a_{i}; θ))^{2}$ .

Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.

Update $s_{t} \leftarrow s_{t + 1}$ .

Link to original

TD Error
Definition

The TD error $δ$ is the difference between the target and the current prediction.

TD error for State-Value Function

$δ_{t} = r_{t + 1} + γV (s_{t + 1}) - V (s_{t})$

TD error for SARSA

$δ_{t} = r_{t + 1} + γ Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})$

TD error for Q-Learning

$δ_{t} = r_{t + 1} + γ max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})$
Link to original

Prioritized Replay
Definition

Prioritized replay is an enhancement to the standard experience replay. Instead of uniformly sampling experiences from the replay buffer, prioritized replay samples important transitions more frequently based on their priority values. It makes it possible to learn more efficiently as the agent focuses on important experiences more.

The priority of transition $i$ is calculated by TD Error. $p_{i} = ∣ δ_{i} ∣ + ϵ$ where $δ$ is the TD Error, and $ϵ \in R^{+}$ .

The probability of sampling transition $i$ is $P (i) = \frac{p _{i}^{α}}{\sum _{k} p _{k}^{α}}$ where $α$ controls how much prioritization is used (uniform sampling if $α = 0$ ).

However, it can lead to a loss of diversity and introduce bias, but these issues can be alleviated and corrected with stochastic sampling prioritization (using $α$ ) and Importance Sampling weights.

The Importance Sampling weights are calculated as $w_{i} = (\frac{1}{N} \cdot \frac{1}{P ( i )})^{β}$ where

$N$ is the size of the replay buffer

$β \in [0, 1]$ is annealing parameter that fully compensates the bias when $β = 1$ . It starts from $0$ and ends with $1$ .

The wight $w_{i}$ is normalized by $\frac{1}{k max w _{k}}$ before used for stability.

Algorithm

Double DQN with Prioritized Replay

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .

Repeat for each episode:

Initialize sequence $s_{0}$ .

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ with maximal priority $p_{t} = i < t max p_{i}$ .

For every $K$ (replay period) steps:

Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i + 1}, s_{i + 1}) \sim P (i) = \frac{p _{i}^{α}}{\sum _{k} p _{k}^{α}}$ from $R$ .

Compute the importance weight $w_{i} = \frac{( NP ( i ) ) ^{- β}}{k max w _{k}}$

Compute TD Error $δ_{t} = r_{i + 1} + γ \hat{Q} (s_{i + 1}, a argmax Q (s_{i + 1}, a; θ); \hat{θ}) - Q (s_{i}, a_{i}; θ)$

Update transition priority $p_{i} \leftarrow ∣ δ_{i} ∣ + ϵ$

Accumulate weight-change $Δ \leftarrow Δ + w_{i} δ_{i} \nabla_{θ} Q (s_{i}, a_{i}; θ)$

Update behavior network weights $θ \leftarrow θ + η Δ$ and reset $Δ = 0$

Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.

Update $s_{t} \leftarrow s_{t + 1}$ .

Link to original

Dueling DQN
Definition

Dueling DQN has two streams: advantage $A (s, a)$ and action-independent state-value $V (s)$ , sharing a feature encoder (CNN), and combined by an aggregator to produce action-value $Q (s, a) = V (s) + A (s, a)$ .

Aggregating module is unidentifiable in the sense that given a Q-value function, there are multiple possible decompositions into value and advantage functions ( $∵ Q (s, a) = (V (s) + c) + (A (s, a) - c)$ where $c$ is a constant value). This makes learning process unstable and less efficient. To force a unique decomposition, we introduce a constraint that makes the Advantage Function have zero-mean. $Q (s, a) = V (s) + (A (s, a) - \frac{1}{∣ A ∣} a^{'} \sum A (s, a^{'}))$
Link to original

REINFORCE

Policy Gradient
Definition

Policy gradient algorithms directly learn the optimal policy by a parametric probability distribution $π_{θ} (a ∣ s)$ . The policy stochastically selects an action $a$ in a state $s$ according to a parameter $θ$ . It typically proceeds by sampling a stochastic policy and adjusting the parameter $θ$ in the direction of maximizing the total reward.

The objective function of policy gradient algorithm is defined as: $J (θ) = E_{π_{θ}} [r (τ)] = \int p (τ; θ) r (τ) d τ$ where $τ$ is a trajectory, $r (τ)$ is the total reward of the trajectory $τ$ , and $p (τ; θ) = π_{θ} (τ)$ is the PDF of $τ$ .
Link to original

Policy Gradient Theorem
Definition

$\nabla_{θ} J (θ) = \nabla_{θ} E_{π_{θ}} [r (τ)] = E_{π_{θ}} [r (τ) \sum_{t = 0}^{T - 1} \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})]$ The derivative of the expected total reward is the expectation of the product of total rewards and summed gradients of log of the policy $π_{θ}$ .

Proof

$\nabla E_{π} [r (τ)] = \nabla \int r (τ) π (τ) d τ = \int \frac{\nabla π ( τ )}{π ( τ )} r (τ) π (τ) d τ = \int \nabla ln (π (τ)) r (τ) π (τ) d τ = E [\nabla ln (π (τ)) r (τ)]$ $\nabla ln π (τ) = \nabla ln (p (s_{0}) \prod_{t = 0}^{T - 1} π (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t})) = t = 0 \sum T - 1 \nabla ln π (a_{t} ∣ s_{t})$
Link to original

REINFORCE
Definition

$\nabla_{θ} J (θ) \approx \frac{1}{M} i = 1 \sum M [\sum_{t = 0}^{T - 1} G_{t}^{(i)} \nabla_{θ} ln π_{θ} (a_{t}^{(i)} ∣ s_{t}^{(i)})]$ REINFORCE algorithm is a policy gradient algorithm that maximizes the expected return. The objective function of REINFORCE algorithm based on the Policy Gradient Theorem. It substitutes the expectation and the total reward of Policy Gradient with averaging and returns $G_{t}$ .

Algorithm

Execute $M$ trajectories (Each starts from a state $s$ under the policy $π_{θ}$ ).

Approximate the gradient of the objective function $J (θ)$ $g_{θ} \leftarrow \frac{1}{M} i = 1 \sum M [\sum_{t = 0}^{T - 1} G_{t}^{(i)} \nabla_{θ} ln π_{θ} (a_{t}^{(i)} ∣ s_{t}^{(i)})]$

Update policy to maximize $J (θ)$ $θ \leftarrow θ + α g_{θ}$ where $α$ is a learning rate.

Link to original

REINFORCE with Baseline
Definition

REINFORCE with Baseline algorithm is a variant of the REINFORCE algorithm that helps reduce variance in Policy Gradient methods by adopting Actor-Critic Method. It modifies the objective function of the REINFORCE algorithm by subtracting a baseline from the returns, which helps reduce the variance of the gradient without introducing bias. $\nabla_{θ} J (θ) = \nabla_{θ} E_{π_{θ}} [r (τ)] = E_{π_{θ}} [\sum_{t = 0}^{T - 1} (G_{t} - b (s_{t})) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})]$ where $b (s_{t})$ is a baseline function not related to $a_{t}$ (commonly chosen as a State-Value Function $V (s) = E_{π_{θ}} [G_{t} ∣ S_{t} = s]$ ).

Actor (Policy Gradient Update)

$Δ θ = α (G_{t} - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})$

Critic (Value Network Update)

$Δ ϕ = β (G_{t} - V (s_{t}; ϕ)) \nabla_{ϕ} V (s_{t}; ϕ)$

Algorithm

Initialize state-value $V (s; ϕ)$ and policy $π (a ∣ s; θ)$ networks randomly.

Set the hyperparameters: step-sizes $α, β > 0$ , and discount factor $0 < γ \leq 1$

Repeat for each episode (Each starts from a state $s$ under the policy $π_{θ}$ ):

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

$δ \leftarrow G_{t} - V (s_{t}; ϕ)$

$ϕ \leftarrow ϕ + β δ \nabla_{ϕ} V (s_{t}; ϕ)$ (minimize $L (ϕ) = E_{π} [\sum_{t = 0}^{T - 1} (G_{t} - V (s_{t}; ϕ))^{2}]$ )

$θ \leftarrow θ + α γ^{t} δ \nabla_{θ} ln π (a_{t} ∣ s_{t}; θ)$ (maximize $J (θ)$ with the Policy Gradient $\nabla_{θ} J (θ) = E_{π_{θ}} [\sum_{t = 0}^{T - 1} (G_{t} - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})]$ )

Link to original

Actor-Critics

Actor-Critic Method
Definition

Actor-Critic method consists of two networks: actor and critic networks.

Actor network updates parameter $θ$ for policy $π_{θ} (a ∣ s)$ by maximizing $J (θ)$ using Policy Gradient $\nabla_{θ} J (θ) = E_{π_{θ}} [\sum_{t = 0}^{T - 1} target \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})]$

Critic network updates parameter $ϕ$ for value function $V (s; ϕ)$ by minimizing $L (ϕ) = E_{π} [\sum_{t = 0}^{T - 1} (target - V (s_{t}; ϕ))^{2}]$

Examples

REINFORCE with Baseline

Actor-Critic Method with TD(0) Return

Asynchronous Advantage Actor-Critic Method

Deterministic Policy Gradient

Deep Deterministic Policy Gradient
Link to original

Actor-Critic Method with TD(0) Return
Definition

Actor (Policy Gradient Update)

$Δ θ = α (r_{t + 1} + γV (s_{t + 1}; ϕ) - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t})$

Critic (Value Network Update)

$Δ ϕ = β (r + γV (s_{t + 1}; ϕ) - V (s_{t}; ϕ)) \nabla_{ϕ} V (s_{t}; ϕ)$

Algorithm

Initialize critic $V (s; ϕ)$ and actor $π (a ∣ s; θ)$ networks randomly.

Set the hyperparameters: step-sizes $α, β > 0$ , and discount factor $0 < γ \leq 1$

Repeat for each episode (Each starts from a state $s$ under the policy $π_{θ}$ ):

$I = 1$

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

Select action $a$ according to policy $π (\cdot ∣ s; θ)$ .

Take the action $a$ and observe a reward $r$ and a next state $s^{'}$ .

$δ \leftarrow r + γV (s^{'}; ϕ) - V (s; ϕ)$

$ϕ \leftarrow ϕ + β δ \nabla_{ϕ} V (s; ϕ)$

$θ \leftarrow θ + α I δ \nabla_{θ} ln π (a ∣ s; θ)$

$I \leftarrow γ I$

$s \leftarrow s^{'}$

Link to original

Asynchronous Advantage Actor-Critic Method
Definition

A3C (Asynchronous Advantage Actor-Critic Method) is an Actor-Critic Method that utilizes multiple networks which are a global network and multiple worker agents working in parallel across multiple instances of the environment, and agents update asynchronously the global network parameter. The parallelism reduce each agent’s temporal correlation. The return $G_{t : t + n} - V (s_{t})$ estimates the Advantage Function $Q (s_{t}, a_{t}) - V (s_{t})$ using the n-Step Return $G_{t : t + n}$ instead of Action-Value Function $Q (s_{t}, a_{t})$ .

Actor (Policy Gradient Update)

$Δ θ = α [(G_{t : t + n} - V (s_{t}; ϕ)) \nabla_{θ} ln π_{θ} (a_{t} ∣ s_{t}) + λ H (π_{θ} (a_{t} ∣ s_{t}))]$ where $H (p (x)) = - \int p (x) ln p (x) d x$ is an Entropy that encourages exploration

Critic (Value Network Update)

$Δ ϕ = β (G_{t : t + n} - V (s_{t}; ϕ)) \nabla_{ϕ} V (s_{t}; ϕ)$

Algorithm

For each worker agent

Let $θ, ϕ$ be global network’s shared parameters, and let $θ^{'}, ϕ^{'}$ be a worker agent’s parameters.

Set the hyperparameters: step-sizes $α, β > 0$ , discount factor $0 < γ \leq 1$ , regularization factor $λ$ , maximum steps per update $t_{max}$ .

Repeat:

Reset gradients $d θ \leftarrow 0$ and $d ϕ \leftarrow 0$ .

Synchronize parameters $θ^{'} \leftarrow θ$ and $ϕ \leftarrow ϕ^{'}$ .

Set $t_{start} \leftarrow t$ and get state $s_{t}$ .

For $t \leq t_{start} + t_{max}$ :

Select action $a_{t}$ according to policy $π (\cdot ∣ s_{t}; θ^{'})$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

$t \leftarrow t + 1$

$R = V (s_{t}; ϕ^{'})$ (or $0$ for terminal $s_{t}$ )

For $i = t - 1, t - 2, \dots, t_{start}$ :

$R \leftarrow r_{i + 1} + γ R$ .

Accumulate gradients with respect to $θ^{'}$ $d θ \leftarrow d θ + (R - V (s_{i}; ϕ^{'})) \nabla_{θ^{'}} ln π_{θ^{'}} (a_{i} ∣ s_{i}) + λ H (π_{θ} (a_{i} ∣ s_{i}))]$

Accumulate gradients with respect to $ϕ^{'}$ $d ϕ \leftarrow d ϕ - (R - V (s_{i}; ϕ^{'})) \nabla_{ϕ^{'}} V (s_{t}; ϕ^{'})$

Update asynchronously $θ \leftarrow θ + α d θ$ and $ϕ \leftarrow ϕ - β d ϕ$ .

Link to original

DPGs

Deterministic Policy Gradient
Definition

Deterministic Policy Gradient (DPG) learns a deterministic policy $a = μ (s)$ as an actor on continuous action spaces and an Action-Value Function $Q (s, a)$ as a critic.

DPG requires fewer samples to approximate the gradient than stochastic Policy Gradient because DPG updates the parameter only over the state space, according to the Deterministic Policy Gradient Theorem. $Δ θ = \nabla_{a} Q (s, a) ∣_{a = μ (s; θ)} \nabla_{θ} μ (s; θ)$
Link to original

State Visitation Frequency
Definition

State visitation frequency is the discounted sum of probabilities of visiting a given state $s$ under policy $π$ . $ρ_{π} (s) = t = 0 \sum \infty γ^{t} P (s_{t} = s ∣ π) = \int_{S} \sum_{t = 0}^{\infty} γ^{t} p_{0} (s^{'}) p_{t} (s^{'} \to s ∣ π) d s^{'}$ where $p_{0} (s^{'})$ is the initial and $p_{t} (s^{'} \to s ∣ π)$ is the visitation probability from $s^{'}$ to $s$ in $t$ step following $π$ .

$(1 - γ) ρ_{π}$ can be considered as a Distribution Function over the state space.
Link to original

Deterministic Policy Gradient Theorem
Definition
$\nabla_{\theta}J(\theta) &= \nabla_{\theta} E_{s\sim \rho_{\mu}}[Q(s, a)|_{a=\mu(s;\theta)}]\\ &= \int_{S} \rho_{\mu}(s)\nabla_{a}Q(s, a)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)ds\quad\text{(by chain rule)}\\ &= E_{s\sim\rho_{\mu}}[\nabla_{a}Q(s, a)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)] \end{aligned}$$$ Link to original

Deep Deterministic Policy Gradient
Definition

Deep Deterministic Policy Gradient (DDPG) is an Actor-Critic Method based on DPG that learns deterministic policies to handle continuous action space by adapting DQN architecture such as experience replay and target network.

Due to the determinacy of the policy of DDPG, extra noise is necessary for exploration. The exploration policy is constructed by the sum of the determinant action and a noise process. $a_{t} = μ (s_{t}; θ) + N_{t}$ where $N := (N_{t})$ is a noise process.

The target network updating in DQN is substituted by soft (gradual) updating.
$\hat{\phi} &\leftarrow \lambda\phi + (1-\lambda)\hat{\phi}\\ \hat{\theta} &\leftarrow \lambda\theta + (1-\lambda)\hat{\theta} \end{aligned}$$ ## Actor (Policy Gradient Update) $$J(\theta) = E_{s\sim \rho_{\mu}}[Q(s, a; \phi)]\Big|_{a = \mu(s;\theta)}$$ $$\Delta\theta = \nabla_{a}Q(s, a;\phi)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)$$ Where: - $\phi$: behavior critic network. - $\theta$: behavior actor network. ## Critic (Value Network Update) $$L(\phi) = E_{s\sim \rho_{\mu}}[r + \gamma \hat{Q}(s', \hat{\mu}(s';\hat{\theta}); \hat{\phi}) - Q(s, a; \phi)]^{2}\Big|_{a = \mu(s;\theta)}$$ $$\Delta\phi = (r + \gamma \hat{Q}(s', \hat{\mu}(s';\hat{\theta}); \hat{\phi}) - Q(s, a; \phi))\nabla_{\phi}Q(s, a; \phi)$$ Where: - $\phi$: behavior critic network. - $\theta$: behavior actor network. - $\hat{\phi}$: target critic network. - $\hat{\theta}$: target actor network. # Algorithm 1. Initialize critic network $Q(s, a; \phi)$ and actor network $\mu(s; \phi)$ randomly, target networks $\hat{Q}, \hat{\mu}$ with weights $\hat{\phi} = \phi, \hat{\theta} = \theta$, and the replay buffer $\mathcal{R}$ to max size $N$. 3. Repeat for each episode: 1. Initialize state $s_{0}$ and a random noise process $\mathcal{N}$. 2. Repeat for each step of an episode until terminal, $t=0, 1, \dots, T-1$ : 1. Select an action $a_{t} = \mu(s_{t}; \theta) + \mathcal{N}_{t}$. 2. Take the action $a_{t}$ and observe a reward $r_{t+1}$ and a next state $s_{t+1}$. 3. Store transition $(s_{t}, a_{t}, r_{t+1}, s_{t+1})$ in the replay buffer $\mathcal{R}$. 4. Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i+1}, s_{i+1})$ from $\mathcal{R}$. 5. $y_{i} \leftarrow r_{i+1} + \gamma \hat{Q}(s_{i+1}, \hat{\mu}(s_{i+1};\hat{\theta}); \hat{\phi})$. 6. Update critic network by minimizing the loss $L = \cfrac{1}{B}\sum\limits_{i}(y_{i} - Q(s_{i}, a_{i}; \phi))^{2}$. 7. Update actor network using the [[Deterministic Policy Gradient]] $\nabla_{\theta}J \approx \cfrac{1}{B}\sum\limits_{i}\nabla_{a}Q(s_{i}, a;\phi)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s_{i};\theta)$ 8. Update target networks $\hat{\phi} \leftarrow \lambda\phi + (1-\lambda)\hat{\phi}$ $\hat{\theta} \leftarrow \lambda\theta + (1-\lambda)\hat{\theta}$$ Link to original

TRPO

Trust Region Policy Optimization
Definition

The performance of DDPG does not improve monotonically. TRPO addresses the problem by adopting two concepts: Minorize-Maximization Algorithm and trust region. TRPO updates the policy parameter $θ$ on a trust region in the policy space, guaranteeing monotonic improvement of the objective function (expected return). Although TRPO has achieved high performance, its implementation is very complex so being not practical.

TRPO starts from the maximization problem of expected return $η (π) := E_{τ} [t = 0 \sum \infty γ^{t} (s_{t})]$

The original problem is approximated multiple times to find a surrogate function and be solved by MM Algorithm. The surrogate function satisfying the trust region constraint is found using the conservative policy iteration update. $η (π) \geq L_{π_{old}} (π) - \frac{4 ϵ γ}{( 1 - γ ) ^{2}} max_{s} D_{K L} (π_{θ_{old}} (\cdot ∣ s) ∣∣ π_{θ} (\cdot ∣ s))$ where $L_{π_{old}} (π) = η (π) + s \sum ρ_{π_{old}} (s) a \sum π (a ∣ s) A_{π_{old}} (s, a)$ is a local approximation of $η (π)$ , and $ρ_{π} (s)$ is a State Visitation Frequency.

The KL penalized optimization problem is transformed to a constrained optimization problem by Duality. $max_{θ} L_{π_{θ_{old}}} (π_{θ}) subject to max_{s} D_{K L} (π_{θ_{old}} (\cdot ∣ s) ∣∣ π_{θ} (\cdot ∣ s)) \leq δ$

By approximations, the final optimization problem is $E_{s \sim ρ_{θ_{old}}, a \sim π_{θ_{old}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} A_{θ_{old}} (s, a)] subject to E_{s \sim ρ_{θ_{old}}} [D_{K L} (π_{θ_{old}} (\cdot ∣ s) ∣∣ π_{θ} (\cdot ∣ s))] \leq δ$

The optimization problem is solved by the natural policy gradient and check the monotonic improvement with Backtracking Line Search $θ = θ_{old} + \frac{2 δ}{g ^{⊺} F ^{- 1} g} F^{- 1} g$ where $g = \nabla_{θ} L_{θ_{old}} (θ) ∣_{θ = θ_{old}} = \nabla_{θ} (\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} A_{θ_{old}} (s, a))_{θ = θ_{old}}$ is a policy gradient, $F$ is the Fisher Information Metric, and $δ$ is the constraint threshold.

TRPO is performed by repeating 3 steps in each iteration

Collect trajectories on $π_{old}$ , and update $Q$ -values.

By averaging over samples, construct the estimated objective and constraint.

Approximately solve the constrained optimization by natural policy gradient and Backtracking Line Search to update the policy parameter $θ$ .

Algorithm

Initialize policy parameters $θ_{0}$ .

For $k = 0, 1, \dots, K$ :

Collect a set of trajectories under current policy $π_{k} = π (θ_{k})$ .

Compute advantages $A^{π_{k}} (s, a)$ using an advantage estimation algorithm.

Compute policy gradient $g_{k}$ and Fisher Information Metric $F_{k}$ .

Compute natural gradient $x_{k} \approx F_{k}^{- 1} g_{k}$ .

Estimate proposed step $Δ_{k} \leftarrow \frac{2 δ}{x _{k}^{⊺} F _{k} x _{k}} x_{k}$

Perform Backtracking Line Search to obtain final update: 2. Find minimum $j \in Z_{0}^{+}$ such that the KL constraint is satisfied. $j min j \in Z_{0}^{+} such that D_{K L} (π_{θ_{k}} (\cdot ∣ s) ∣∣ π_{θ_{k} + α^{j} Δ_{k}} (\cdot ∣ s)) \leq δ$

Update the policy $θ_{k + 1} \leftarrow θ_{k} + α^{j} Δ_{k}$

Link to original

MM Algorithm
Definition

The MM algorithm is an iterative optimization method which exploits the convexity of a function in order to find its maxima or minima. The MM stands for Majorize-Minimization or Minorize-Maximization, depending on whether the desired optimization is a minimization or a maximization.

The MM algorithm works by finding a surrogate function that minorizes or majorizes the objective function. Optimizing the surrogate function will monotonically improve the value of the objective function.

Algorithm

Consider Minorize-Maximization algorithm for a maximization problem of a function $f (θ)$ .

Find a surrogate function $g (θ ∣ θ_{m})$ (minorized $f (θ)$ ) satisfying:

$\forall θ, g (θ ∣ θ_{m}) \leq f (θ)$

$g (θ ∣ θ_{m}) = f (θ_{m})$

Find a point maximizes the surrogate function $θ_{m + 1} = θ argmax g (θ ∣ θ_{m})$ Then, $f (θ_{m}) = g (θ_{m} ∣ θ_{m}) \leq g (θ_{m + 1} ∣ θ_{m}) \leq f (θ_{m + 1})$

Use the maximum point $θ_{m + 1}$ as the next point.

Re-evaluate the surrogate function $g (θ ∣ θ_{m + 1})$ at the new point $θ_{m + 1}$ and repeat Iteration until convergence to a (local) maximum.

Link to original

Natural Gradient
Definition

The natural gradient is an optimization method that takes into account the geometry of the parameter space when updating parameters. Unlike the standard gradient, which points in the direction of the steepest ascent in the Euclidean Space of parameters, the natural gradient points in the direction of the steepest ascent in the space of probability Distribution induced by the parameters.

$\tilde{\nabla}_{θ} J (θ) = F (θ)^{- 1} \nabla_{θ} J (θ)$ The natural gradient uses the Fisher Information matrix $F (θ)$ to define a Metric on the parameter space (Fisher Information Metric). It can be interpreted as the steepest ascent direction in the space of probability distributions, as measured by the KL-Divergence. Instead of taking a step that maximizes the change in the objective function $J$ in the Euclidean space of parameters, the natural gradient takes a step that maximizes the change in $J$ while keeping the change in the probability distribution, as measured by KL divergence, relatively small.
Link to original

PPO

Proximal Policy Optimization
Definition

Although TRPO has achieved monotonic improvement, its implementation and computation are complicated due to KL constraint. PPO removed the KL constraint term in the optimization problem of TRPO by introducing the clipped surrogate objective, which provides penalty when the policy update is too high.

PPO apply clipping to the policy ratio $\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )}$ . The cliiped surrogate function is defined as $L^{C L I P} (θ) = E [min (r (θ) A_{θ_{old}} (s, a), clip (r (θ), 1 - ϵ, 1 + ϵ) A_{θ_{old}} (s, a))]$ where $r (θ) = \frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )}$ is the policy ratio, and $ϵ$ is a hyperparameter that determines the clipping range.

$L^{C L I P} (θ)$ satisfies $L^{C L I P} (θ) \leq L^{TRPO} (θ)$ , so we can still use MM Algorithm.

To minimize the value function loss simultaneously and to encourage exploration, PPO appends additional terms to the objective function. $L^{C L I P + V F + S} (θ) = L^{C L I P} - c_{1} L^{V F} (θ) + c_{2} S [π_{θ}] (s)$ Where

$L^{C L I P} (θ)$ is the clipped surrogate objective

$L^{V F} (θ) = (V_{target} - V_{θ} (s))^{2}$ is the value function loss

$S [π_{θ}] (s) = - a \sum π_{θ} (a ∣ s) ln π_{θ} (a ∣ s)$ is the Entropy bonus terms.

$c_{1}, c_{2}$ are hyperparameter that controls the weight of the losses.

Link to original

Distributional RL

Distributional Reinforcement Learning
Definition

Distributional RL treats the Reward $R (x, a)$ as a Random Variable, and uses random return $Z^{π} (x, a) = t \geq 0 \sum γ^{t} R (x_{t}, a_{t})$ , called an action-value distribution, instead of the Action-Value Function.

Algorithms

C51

QR-DQN

IQN
Link to original

Distributional Bellman Equation
Definition

Distributional Bellman Equation

$Z^{π} (x, a) = D R (x, a) + γ Z^{π} (X^{'}, A^{'})$

Distributional Bellman Operator

$(T^{π} Z) (x, a) := D R (x, a) + γ Z (X^{'}, A^{'})$

The distributional Bellman equation can be written with the operator as $Z^{π} = T^{π} Z^{π}$
Link to original

Distributional Bellman Optimality Equation
Definition

Distributional Bellman Optimality Equation

$Z^{*} (x, a) = D R (x, a) + γ Z^{*} (X^{'}, a^{'} argmax E [Z^{*} (X^{'}, a^{'})])$

Distributional Bellman Optimality Operator

$T Z (x, a) := D R (x, a) + γ Z (X^{'}, a^{'} argmax E [Z (X^{'}, a^{'})])$

The distributional Bellman optimality equation can be written with the operator as $Z^{*} = T^{π} Z^{*}$
Link to original

Distributional Policy Iteration
Definition

Policy Evaluation

Compute $Z^{π}$ from the current policy $π$ .

Distributional Bellman operator $T^{π}$ for a fixed policy $π$ is a Contraction Map in a maximal form of Wasserstein Metric. $\overline{W}_{p} (Z_{1}, Z_{2}) = sup_{x, a} W_{p} (Z_{1} (x, a), Z_{2} (x, a))$

Thus, the iteration $Z \leftarrow T^{π} Z$ converges to a unique fixed point $Z^{π}$ (by Contraction Lemma).

Control (Policy Improvement)

Seek a greedy policy $π$ based on the $Z^{*}$ , which maximizes the expectation of $Z^{*}$ .

Distributional Bellman optimality operator $T$ is not a Contraction Map, not even continuous in any Metric, causing the distributional instability. To overcome the problem, discrete action-value distributions are used to approximate the true distribution $Z$ .
Link to original

C51

C51
Definition

The C51 algorithm is a Distributional Reinforcement Learning algorithm that approximates the distribution of the random return a using discretized distribution.

Architecture

The range of possible returns is divided into a fixed support of $N$ -equally spaced bins ${z_{i}}_{i = 0}^{N - 1}$ , where $z_{i} = R_{min} + i Δ z$ and $Δ z = \frac{R _{max} - R _{min}}{N - 1}$ . $R_{min}$ and $R_{max}$ are the minimum and maximum possible return values, respectively. In the C51 algorithm, the support determined once at the beginning and remains fixed throughout the entire learning process.

For each state-action pair $(s, a)$ , the algorithm estimates the probabilities ${p_{i} (s, a)}$ that the return will fall into each $i$ -th bin. The probabilities are estimated by a neural network, which takes the state $s$ as input and outputs a vector of probabilities ${p_{i} (s, a)}$ for each state-possible action pair $(s, a), \forall a \in A (s)$ . The Softmax Function ensures the properties of the probability of the output for each state-action pair. $p_{i} (s, a; θ) = \frac{e ^{θ_{i} (s, a)}}{\sum _{j = 0}^{N - 1} e ^{θ_{j} (s, a)}}$ where $θ_{i} (s, a)$ are the raw outputs (logits) of the network for the $i$ -th atom and action $a$ , and $θ$ represents the network’s parameters.

The discrete distribution $Z_{θ}$ over the fixed support ${z_{i}}_{i = 0}^{N - 1}$ is constructed as $Z_{θ} (s, a) = i = 0 \sum N - 1 p_{i} (s, a) δ_{z_{i}}$ where $δ_{x}$ is a Dirac’s delta function at $x$ .

Projection

In a policy evaluation process, discrete distributions have a problem: $T Z_{θ}$ and $Z_{θ}$ have disjoint supports. Though, Wasserstein Metric in the original setup is robust to this issue, but in practice, KL-Divergence is used instead due to the distributional instability and differentiability. Therefore, the problem still matters.

The disjoint support problem is solved by projection.

Compute the distributional Bellman update: $\hat{Z}_{θ} \leftarrow r + γ Z_{θ} (s^{'}, a^{*})$ where $a^{*} = a argmax Q (s^{'}, a)$

Distribute the computed probability mass $\hat{Z}_{θ}$ to the neighboring bins in the support ${z_{i}}_{i = 0}^{N - 1}$ proportionally to the distance from each original point: $(Φ T Z_{θ}) (s, a)$ , where $Φ$ is the projection operator.

Loss Function

C51 uses the KL-Divergence as a loss function. $L_{s, a} (θ) = D_{K L} (Φ \hat{T} Z_{\hat{θ}} (s, a) ∣∣ Z_{θ} (s, a)))$ where $θ$ and $\hat{θ}$ are the parameters for behavior and target networks, respectively.

The loss function can be simplified to the Cross-Entropy Loss. $L (θ) = - i = 0 \sum N - 1 (Φ \hat{T} Z (s, a; \hat{θ}))_{i} ln p_{i} (s, a; θ)$

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .

Repeat for each episode:

Initialize sequence $s_{0}$ .

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .

Sample a random transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ from $R$ .

Compute $Q (s_{t + 1}, a) = i = 0 \sum N - 1 z_{i} p_{i} (s_{t + 1}, α; \hat{θ})$ and select a greedy action $a^{*} = a argmax Q (s_{t + 1}, a)$ .

Perform Gradient Descent on the loss $L (θ) = - i = 0 \sum N - 1 (Φ \hat{T} Z (s_{t + 1}, a^{*}; \hat{θ}))_{i} ln p_{i} (s_{t}, a_{t}; θ)$ .

Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.

Update $s_{t} \leftarrow s_{t + 1}$ .

Link to original

QR-DQN

QR-DQN
Definition

The Quantile-Regression DQN (QR-DQN) algorithm is a Distributional Reinforcement Learning algorithm that approximates the distribution of random return using quantile regression.

Architecture

QR-DQN estimates a set of $N$ quantiles ${\overset{τ}{^}_{i}}_{i = 1}^{N}$ of return distribution, where $\overset{τ}{^}_{i} = \frac{2 i - 1}{2 N}$ represents the midpoint of $i$ -th quantile interval. This can be seen as adjusting the location of the supports of a uniform probability mass to approximate the desired quantile distribution.

The quantile distribution $Z$ with uniform probability $\frac{1}{N}$ is constructed as $Z (s, a) = \frac{1}{N} i = 1 \sum N (s, a) δ_{z_{i} (s, a)}$ where $δ_{x}$ is a Dirac’s delta function at $x$ , and ${z_{i} (s, a)}$ are the outputs of the network, representing the estimated quantile values. These values are obtained by applying the inverse CDF of the return distribution to the quantile midpoints $z_{i} (s, a) = F_{Z}^{- 1} (\overset{τ}{^}_{i})$ .

Using the estimated quantile values as a support minimizes the Wasserstein Distance between the true return distribution and the estimated distribution.

Quantile Regression

Given data set ${x_{j}}$ , a $τ$ -quantile $q_{τ}$ minimizes the loss $j = 1 \sum N ρ_{τ} (x_{j} - q_{τ})$ , where $ρ_{τ} (x) = x (τ - 1 (x < 0))$ is a quantile loss function.

The quantile values are estimated by minimizing the quantile Huber loss function. Given a transition $(s, a, r, s^{'})$ , the loss is defined as $L (θ) = i = 1 \sum N E_{j} [ρ_{\overset{τ}{^}_{i}}^{κ} (T z_{j} - z_{i} (s, a))] = \frac{1}{N} i = 1 \sum N j = 1 \sum N ρ_{\overset{τ}{^}_{i}}^{κ} (T z_{j} - z_{i} (s, a))$ Where:

$T z_{j} = r + γ z_{j} (s^{'}, a^{*})$ and $a^{*} = a argmax Q (s^{'}, a)$ .

$ρ_{τ}^{κ} (x) = L_{κ} (x) (τ - 1 (x < 0))$ where $L_{κ} (x)$ is Huber Loss.

Algorithm

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the sample size $N$ .

Repeat for each episode:

Initialize sequence $s_{0}$ .

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .

Sample a random transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ from $R$ .

Select a greedy action $a^{*} = a argmax Q (s_{t + 1}, a) = a argmax \frac{1}{N} i = 0 \sum N - 1 z_{i} (s_{t + 1}, α; \hat{θ})$

Compute the target quantile values $T z_{j} (s_{t + 1}, a^{*}; \hat{θ}) = r_{t + 1} + γ z_{j} (s_{t + 1}, a^{*})$ .

Perform Gradient Descent on the loss $L (θ) = i = 1 \sum N E_{j} [ρ_{\overset{τ}{^}_{i}}^{κ} (T z_{j} (s_{t + 1}, a^{*}; \hat{θ}) - z_{i} (s_{t}, a_{t}; θ))]$ .

Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.

Update $s_{t} \leftarrow s_{t + 1}$ .

Link to original

IQN

IQN
Definition

The Implicit Quantile Network (IQN) algorithm is a Distributional Reinforcement Learning algorithm that implicitly estimates the quantiles of a return distribution by learning a function that maps a quantile fraction to the corresponding quantile value.

Architecture

IQN estimates the quantile value $Z (s, a, τ) = F_{Z}^{- 1} (τ)$ for a given state $s$ , action $a$ , and a quantile fraction $τ \in [0, 1]$ . The output is the estimated quantile value corresponding to that fraction $τ$ .

The input state $s$ processed by the encoder layers of neural network to produce a state embedding vector $ϕ (s)$ .

The quantile fraction $τ$ is embedded into a higher-dimensional vector $ψ (τ)$ using a set of basis functions, where whose dimension is the same as the one of the state-feature $ϕ (s)$ . $ψ_{i} (τ) = cos (iπ τ)$ The two embeddings: state embedding and quantile-embedding, are combined using Hadamard Product $ϕ (s) ⊙ ψ (τ)$ .

The combined embedding is fed into the further layers $f_{θ}$ to predict the quantile value $Z (s, a, τ)$

In summary, the quantile value for a given state is estimated by $Z (s, a, τ; θ) \approx f_{θ} (ϕ (s), ψ (τ))$

By sampling different quantile fractions $τ$ , IQN can approximate the entire return distribution. The expectation of the return can be approximated by averaging the estimated quantile values over the sampled quantile fractions $Q (s, a; θ) \approx \frac{1}{N} i = 1 \sum N Z (s, a, τ_{i}; θ)$ where ${τ_{i}}_{i = 1}^{N}$ are sampled quantile fractions.

Loss Function

IQN uses a quantile Huber loss function, similar to QR-DQN. Given a transition $(s, a, r, s^{'})$ , the loss is defined as $L (θ) = i = 1 \sum N E_{τ^{'}} [ρ_{\overset{τ}{^}_{i}}^{κ} (T Z_{τ_{j}^{'}} - Z (s, a, τ_{i}; θ))]$ Where:

${\overset{τ}{^}_{i}}_{i = 1}^{N}$ and ${τ_{j}^{'}}_{j = 1}^{N^{'}}$ are quantile fractions for the current state-action pair and next state, respectively.

$T Z_{τ_{j}^{'}} = r + γ Z (s^{'}, a^{*}, τ_{j}^{'}; \hat{θ})$ and $a^{*} = a argmax \frac{1}{N ^{'}} j = 1 \sum N^{'} Z (s^{'}, a, τ_{j}^{'}; \hat{θ}) \approx a argmax Q (s^{'}, a; \hat{θ})$ .

$ρ_{τ}^{κ} (x) = L_{κ} (x) (τ - 1 (x < 0))$ where $L_{κ} (x)$ is Huber Loss.

Risk-Sensitive Policy

$β : [0, 1] \to [0, 1]$ The IQN has information about the entire distribution by estimating the distribution of returns. This property makes it suitable for implementing risk-sensitive policies, allowing for decision-making considering risk. Risk-sensitive policies focus on not just the expected return, but the variability or uncertainty in the returns. IQN makes it feasible by providing estimated quantiles of the return distribution.

The function $β : [0, 1] \to [0, 1]$ is used to define a distortion risk measure that focuses on specific parts of the return distribution. By applying $β$ to the quantile fraction $τ$ , we can re-wright each outcome or change the sampling distribution.

These are examples of distortion measures:

Cumulative Probability Weighting Parametrization (CPW): $CPW (η, τ) = \frac{τ ^{η}}{( τ ^{η} + ( 1 - τ ) ^{η} ) ^{1/ η}}$

Wang: $Wang (η, τ) = Φ (Φ^{- 1} (τ) + η)$ where $Φ$ is the CDF for standard normal distribution.

Power Formula: $Pow (η, τ) = {τ^{1/ (1 + ∣ η ∣)} 1 - (1 - τ)^{1/ (1 + ∣ η ∣)} η \geq 0 η < 0$

Conditional Value at Risk (CVaR): $CVaR (η, τ) = η τ$

Algorithm

Initialize behavior quantile network $Z$ and target quantile network $\hat{Z}$ with random weights $θ$ , the sample sizes $K, N, N^{'}$ , and a distortion measure $β$ .

Repeat for each episode:

Initialize sequence $s_{0}$ .

Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :

Sample $K$ quantile fractions ${\tilde{τ}_{k} \sim β (\cdot)}_{k = 1}^{K}$ under the distortion measure $β$ , | and select a greedy action $a_{t} = a argmax \frac{1}{K} k = 1 \sum K Z (s_{t}, a, \tilde{τ}_{k}; θ) \approx a argmax Q (s_{t}, a; θ)$ .

Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .

Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ .

Sample a random transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ from $R$ .

Sample $N^{'}$ quantile fractions ${τ_{j}^{'} \sim U (0, 1)}_{j = 1}^{N^{'}}$ for the next state and select a greedy action $a^{*} = a argmax \frac{1}{N ^{'}} j = 1 \sum N^{'} Z (s_{t + 1}, a, τ_{j}^{'}; \hat{θ}) \approx a argmax Q (s^{'}, a; \hat{θ})$ .

Compute the target quantile values $T Z_{τ_{j}^{'}} = r_{t + 1} + γ Z (s_{t + 1}, a^{*}, τ_{j}^{'}; \hat{θ})$ .

Perform Gradient Descent on the loss $L (θ) = i = 1 \sum N j = 1 \sum N^{'} ρ_{\overset{τ}{^}_{i}}^{κ} (T Z_{τ_{j}^{'}} - Z (s, a, \overset{τ}{^}_{i}; θ))$ .

Update the target network $\hat{Z}$ parameter $\hat{θ} \leftarrow θ$ every $C$ steps.

Update $s_{t} \leftarrow s_{t + 1}$ .

Link to original

My Knowledge Base

Explorer

Reinforcement Learning Note

Markov Decision Process

Markov Property

Definition

Markov Chain

Definition

Summary

Conditions and Properties of Markov Chain (Finite States)

Conditions and Properties of Markov Chain (Infinite States)

Transition

Definition

Transition Probability

Definition

Facts

Transition Probability Matrix

Definition

Facts

Markov Decision Processes

Definition

Facts

Reward

Definition

Expected Reward

Facts

Reward Hypothesis

Definition

Return

Definition

Policy

Definition

Deterministic Policy

Stochastic Policy

Optimal Policy

Better Policy

Facts

State-Value Function

Definition

Optimal State-Value Function

Action-Value Function

Definition

Optimal Action-Value Function

Advantage Function

Definition

Bellman Equation

Law of Total Probability

Definition

Law of Large Numbers

Definition

Weak Law of Large Numbers

Strong Law of Large Numbers

Bellman Expectation Equation

Definition

Bellman Equation for ﻿State-Value Function

Bellman Optimality Equation

Definition

Bellman Optimality Equation for ﻿State-Value Function

Dynamic Programming

Dynamic Programming

Definition

Algorithms

Value Iteration

Definition

Algorithm

Policy Improvement Theorem

Definition

Policy Iteration

Definition

Policy Evaluation

Policy Improvement

Algorithm

Examples

Reinforcement Learning

Generalized Policy Iteration

Definition

Monte Carlo Method

Definition

MC Prediction (Policy Evaluation)

Empirical Mean

Bellman Equation for State-Value Function

Bellman Optimality Equation for State-Value Function

Constant- $α$ MC Policy Evaluation

MC Control ( $ϵ$ -Greedy Policy Improvement)