Markov Decision Process

Markov Property

Definition

Markov property refers to the memoryless property of a Stochastic Process.

Consider a Probability Space with a Filtration , and a Measurable Space . A Stochastic Process is said to possess the Markov property if Here, in the LHS represents all information up to time , whereas conditions only on the single value at time .

In the case where is a discrete set and , this can be reformulated as follows

Link to original

Markov Chain

Definition

A Markov chain is a Stochastic Process that satisfies Markov Property. It consists of a set of states and a Transition Probability Matrix

Summary

Conditions and Properties of Markov Chain (Finite States)

ExampleFiniteIRRAperiodicity^[a unique Limiting Distribution]^[a unique Stationary Distribution]^[Ergodic Theorem]^[Ergodicity]
OXXXXXX
Identity/Block matrixOXOXXXX
Exchange matrixOOXXOOX
Ergodic Markov ChainOOOOOOO
flowchart LR
  mc[HMC] --> finite
  
  subgraph finiteness
    finite([finite])
  end
  
  finite --> finite_mc
  finite_mc --> irreducible
  finite_mc --> reducible
  
  subgraph Irreducibility
    irreducible([irreducible])
    reducible([reducible])
  end
  irreducible --> irr_mc[Irreducible MC]
  reducible -->|divide|irr_mc

  irr_mc ---|holds|egt([Ergodic Theorem])
  irr_mc --> aperiodic
  
  subgraph Aperiodicity
    aperiodic([aperiodic])
  end
  
  aperiodic --> ergmc[Ergodic MC]

Conditions and Properties of Markov Chain (Infinite States)

IRRNatureAperiodicity^[a unique Limiting Distribution]^[a unique Stationary Distribution]^[Ergodic Theorem]^[Ergodicity]
OTransientcan’t defineXXXX
ONull recurrentno matterXXXX
OPositive recurrentXXOOX
OPositive recurrentOOOOO
flowchart LR
  mc[HMC]
  mc --> irreducible
  mc --> reducible
  
  subgraph Irreducibility
    irreducible([irreducible])
    reducible([reducible])
  end
  
  irr_mc[Irreducible MC]
  irreducible --> irr_mc
  reducible -->|divide|irr_mc
  
  irr_mc --> pr
  irr_mc --> nr
  irr_mc --> tr
  
  subgraph Recurrence
    pr([positive recurrent])
    nr([null recurrent])
    tr([transient])
  end
    
  pr_mc[Recurrent MC]
  pr --> pr_mc
  pr_mc --> aperiodic

  egt([Ergodic Theorem])
  pr_mc ---|holds|egt

  subgraph Aperiodicity
    aperiodic([aperiodic])
  end
  
  aperiodic --> ergmc[Ergodic MC]
Link to original

Transition

Definition

where is the distribution of states and is Transition Matrix

Link to original

Transition Probability

Definition

where is HMC that has countable state space

The conditional probability of the Transition from state to

Facts

Link to original

Transition Probability Matrix

Definition

A transition probability matrix is a matrix whose -th element is . where each is a Transition Probability of HMC having countable state space

Facts

Since the number of elements of state space can be infinity, strictly speaking is not a matrix. However, several operations of matrix are satisfied for the

The sum of each row of a transition matrix is one

Link to original

Markov Decision Processes

Definition

Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows)

Markov decision process (MDP) is a tuple . Where:

  • : A set of states (State space)
  • : A set of actions (Action space)
  • : A Transition Probability from to given .
  • : The immediate Reward received after transitioning from state to due to action .

Facts

Environmental model in MDP is the Transition Probability. If the Transition Probability is known, it is called a model-based MDP, otherwise called a model-free MDP.

Any Markov decision process satisfies the followings:

Link to original

Reward

Definition

Reward is a scalar feedback indicating how well the agent is doing at step .

There are three reward types:

  • : the reward in the state .
  • : the reward by the action in the state .
  • : the reward received after transitioning from state to due to action .

Expected Reward

Under known dynamics of all transitions , one can compute the followings:

  • State transition probability

  • Expected reward for (state, action) pair

  • Expected reward for (state, action, next state) triple

Facts

Reinforcement learning is based on the Reward Hypothesis.

Link to original

Reward Hypothesis

Definition

All goals can be described by the maximization of the expected value of the cumulative sum of rewards.

Link to original

Return

Definition

where is a discount factor

Return is the total discounted Reward from time step .

Link to original

Policy

Definition

Deterministic Policy

A deterministic policy is a mapping from state space to action space .

Stochastic Policy

A stochastic policy is a probability distribution over actions for a given state .

Optimal Policy

A policy maximizing the cumulative sum of rewards is called an optimal policy.

Better Policy

Consider two policies and . is better policy than if

Facts

  • Under known MDP, there exists deterministic optimal policy .
  • Under unknown MDP, stochastic policy is needed.
Link to original

State-Value Function

Definition

where is a Action-Value Function.

A state-value function is the expected discounted Return starting in state under policy .

Optimal State-Value Function

An optimal state-value function is the maximum possible state-value function (a state-value function under optimal policy).

Link to original

Action-Value Function

Definition

An action-value function is the expected discounted Return of taking action starting in state under policy .

Optimal Action-Value Function

An optimal action-value function is the maximum possible action-value function (an action-value function under optimal policy).

Link to original

Advantage Function

Definition

An advantage function is the difference between the Action-Value Function and the State-Value Function under policy . It represents how much better it is to take action in state compared to the average action.

Link to original

Bellman Equation

Law of Total Probability

Definition

Suppose is a set of mutually and collectively exclusive events, then for any event :

Link to original

Law of Large Numbers

Definition

Weak Law of Large Numbers

Consider a Sequence of i.i.d. random variables with mean , variance . Let be the sample average. Then, the sample average converges in probability to the expected value.

Strong Law of Large Numbers

Consider a Sequence of i.i.d. random variables with mean , variance . Let be the sample average. Then, the sample average converges almost surely to the expected value.

Link to original

Bellman Expectation Equation

Definition

Bellman expectation equation is a recursive equation decomposing State-Value Function into immediate Reward and discounted future state-value .

Bellman Equation for State-Value Function

v_{\pi}(s) &= E_{\pi}[G_{t} | S_{t}=s]\\ &= P(A_{t}=a|S_{t}=s)\cdot E_{\pi}[G_{t} | S_{t}=s, A_{t}=a]\quad \left( = \sum\limits_{a} \pi(a|s) q_{\pi}(s, a) \right)\\ &= \sum\limits_{a} \pi(a|s) \cdot E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_{t}=s, A_{t}=a]\\ &= \sum\limits_{a} \pi(a|s) \cdot \sum\limits_{s', r} E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_{t}=s, A_{t}=a, S_{t+1}=s', R_{t+1}=r]\cdot P(S_{t+1}=s', R_{t+1}=r | S_{t}=s, A_{t}=a)\\ &= \sum\limits_{a} \pi(a|s) \cdot \sum\limits_{s', r} p(s', r | s, a)[r + \gamma E_{\pi}[G_{t+1} | S_{t+1}=s']]\quad \text{(by Markov property)}\\ &= \sum\limits_{a} \pi(a|s) \cdot \sum\limits_{s', r} p(s', r | s, a)[r + \gamma v_{\pi}(s')]\quad \left( =\sum\limits_{a} \pi(a|s)[R_{s}^{a} + \gamma \sum\limits_{s'}P_{ss'}^{a} v_{\pi}(s')] \right)\\ &= E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_{t}=s] \end{aligned}$$ ## Bellman Equation for [[Action-Value Function]] ![[Pasted image 20241207130544.png|450]] $$\begin{aligned} q_{\pi}(s, a) &= E_\pi[G_{t} | S_{t}=s, A_{t}=a]\\ &= \sum\limits_{s'r}p(s', r | s, a)[r + \gamma \underbrace{\sum\limits_{a'} \pi(a'|s') q_{\pi}(s', a')}_{v_{\pi}(s')}]\quad \left( = R_{s}^{a} + \gamma \sum\limits_{s'} P_{ss'}^{a} \underbrace{\sum\limits_{a'} \pi(a'|s') q_{\pi}(s', a')}_{v_{\pi}(s')} \right)\\ &= E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_{t}=s, A_{t}=a] \end{aligned}$$Link to original

Bellman Optimality Equation

Definition

Under model-based (known MDP), the optimal value functions can be iteratively evaluated by dynamic programming.

Bellman Optimality Equation for State-Value Function

v_{*}(s) &= \max_{a} q_{*}(s, a)\\ &= \max_{a} E_{\pi_{*}}[G_{t} | S_{t}=s, A_{t}=a]\\ &= \max_{a} E_{\pi_{*}}[R_{t+1} + \gamma G_{t+1} | S_{t}=s, A_{t}=a]\\ &= \max_{a} E_{s',r}[R_{t+1} + \gamma v_{*}(S_{t+1}) | S_{t}=s, A_{t}=a]\\ &= \max_{a} \sum\limits_{s',r}p(s', r | s, a) [r + \gamma v_{*}(s')]\quad \left( =\max_{a} [R_{s}^{a} + \gamma\sum\limits_{s'}P_{ss'}^{a} v_{*}(s')] \right) \end{aligned}$$ ## Bellman Optimality Equation for [[Action-Value Function]] ![[Pasted image 20241207152144.png|450]] $$\begin{aligned} q_{*}(s, a) &= E_{s',r}[R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') | S_{t}=s, A_{t}=a]\\ &= \sum\limits_{s',r}p(s',r | s, a)[r + \gamma \max_{a'} q_{*}(s', a')]\quad \left(= R_{s}^{a} + \gamma \sum\limits_{s'}P_{ss'}^{a} \max_{a'} q_{*}(s', a') \right) \end{aligned}$$Link to original

Dynamic Programming

Dynamic Programming

Definition

Dynamic programming is a method for solving complex problems by breaking them down into simpler sub-problems in a recursive manner.

Dynamic programming works when a problem has the following properties:

  • Optimal substructure: An optimal solution can be constructed from optimal solutions of its sub-problems.
  • Overlapping sub-problems: Solutions of same sub-problems are reused several times.

Algorithms

Value Iteration

Policy Iteration

Link to original

Value Iteration

Definition

Value iteration iteratively updates state-value function until convergence. Its time complexity is where and are the numbers of states and actions respectively.

Algorithm

  1. Initialize
  2. Update iteratively from all (full backup) until convergence to .
    • Synchronoius backups: compute for all and update simultaneously.
    • Asynchronoius backups: compute for one and update it immediately.
  3. Compute the optimal policy (one-step lookahead) and return it.
Link to original

Policy Improvement Theorem

Definition

Consider two policies and . This implies that is a better policy than .

Link to original

Policy Iteration

Definition

Policy iteration repeats policy evaluation and policy improvement until convergence. Its time complexity is where and are the numbers of states and actions respectively.

Compare to Value Iteration, policy iteration requires much fewer iterations to reach optimal policy.

Policy Evaluation

Compute from the deterministic policy .

Policy Improvement

Improve to by greedy policy based on .

Since , always either is strictly better than or is optimal when by the Policy Improvement Theorem.

Algorithm

  1. Initialize , and arbitrarily.
  2. Update every from all (full backup) until convergence to .
  3. Improve by greedy policy based on .
  4. If policy is stable, then stop and return and .

Examples

Link to original

Reinforcement Learning

Generalized Policy Iteration

Definition

Generalized policy iteration uses the repeatedly approximated value function to the true value of the current policy (sample backup) and the policy is repeatedly improved to approach the optimality.

Link to original

Monte Carlo Method

Definition

Monte Carlo method (MC) is tabular updating and model-free. MC policy iteration adapts GPI based on episode-by-episode (sample multi-step backup) of PE estimating and -greedy PI.

MC Prediction (Policy Evaluation)

Learn from entire episodes of real experience under policy .

Empirical Mean

MC policy evaluation uses empirical mean return instead of expected return. where is increment total return and is incremental count.

The mean return converges to true action-value function as incremental count increases.

Incremental MC Update

The incremental mean can be expressed in a recursive manner:

Thus, for each state-action pair with return , after one episode, the mean return can be calculated as:

Constant- MC Policy Evaluation

It substitutes the weights, used for incremental MC, with constant step-size .

MC Control (-Greedy Policy Improvement)

Choose the greedy action with probability and a random action with probability .

For any -greedy policy , -greed policy with respect to is always improved i.e. .

-greedy policy is GLIE if e.g.

Algorithm

  1. Initialize arbitrarily, terminal , and arbitrary -soft policy (non-empty probability).
  2. Repeat for each episode:
    1. Generate an episode following the policy : .
    2. Repeat for each step of an episode, (MC Prediction):
      1. If is a first visit (), then where can be either incremental MC or constant- MC.
    3. For each visited state in the episode (MC Control):
  3. If policy is stable, then stop and return .

Facts

MC is less harmed by violations of Markov Property because it updates value estimates based on true Return (not based on value estimates of successor states).

Link to original

Time Difference Learnings

GLIE policy

Definition

A learning Policy is called GLIE (Greedy in Limit with Infinite Exploration) if it satisfies:

  • All state-action pairs are explored infinitely many times where is incremental count of a .
  • The learning policy converges to a greedy policy. where
Link to original

Temporal Difference Learning

Temporal Difference Learning (TD) is tabular updating and model-free. TD policy iteration adapts GPI based on one-step transitions of sample episodes.

Algorithms

Sarsa

Q-Learning

Examples

Sarsa allows for potential penalties from exploration moves, which tends to make it avoid a dangerous optimal path and prefers a slower but safer path. In contrast, Q-Learning ignores these penalties and takes action with the highest action value, which results in its occasional falling because of -greedy actions.

Sarsa allows for potential penalties from exploration moves, which tends to make it avoid a dangerous optimal path and prefer a slower but safer path. In contrast, Q-Learning ignores these penalties and takes action based on the highest action value, which can occasionally result in falling off due to -greedy action.

Link to original

Sarsa

Definition

Sarsa learns by using Bellman Expectation Equation. The both behavior and target policies of Sarsa is derived from , so it is an on-policy method.

Algorithm

  1. Initialize arbitrarily, and .
  2. Repeat for each episode:
    1. Initialize .
    2. Choose an action from the initial state using policy derived from (e.g. -greedy).
    3. Repeat for each step of an episode until is terminal:
      1. Take the action and observe a reward and a next state .
      2. Choose an action from the next state using policy derived from (e.g. -greedy).
      3. Update and
Link to original

Q-Learning

Definition

Q-learning directly learns by using Bellman Optimality Equation. The behavior policy of Q-learning is derived from while the target policy is , so it is an off-policy method.

Algorithm

  1. Initialize , arbitrarily, and .
  2. Repeat for each episode:
    1. Initialize .
    2. Repeat for each step of an episode until is terminal:
      1. Choose an action from the initial state using policy derived from (e.g. -greedy).
      2. Take the action and observe a reward and a next state .
      3. Update
Link to original

Importance Sampling

Definition

Importance sampling is used when we want to estimate the expectation of a function under a probability distributtion , but sampling directly from the distribution is either inefficient or impossible. Where

  • is the target distribution
  • is the proposal distribution whose support includes that of the support of the target distribution (). The proposal distribution needs to be easy to evaluate and sample from. Also, to reduce the variance of the expectation, should be high where is high.
  • is the importance weight.
Link to original

Off-Policy Control

Definition

Off-policy control uses two different policies for target policy and behavior policy . The target policy may be deterministic while the behavior policy is stochastic. It enables the model to learn optimal policy while maintaining exploration.

Almost all off-policy method utilize Importance Sampling. The Return obtained under the target policy is weighted according to the relative probability of the trajectories under the target and behavior policy, called the importance-sampling ratio (importance weight).

Given a starting state , the probability of the state-action trajectory occurring under a policy is written as where is the state-transition probability.

Thus, the relative probability of the trajectory under the target and behavior policies is

Algorithms

Off-policy MC

where is the importance weight.

The variance of return in Monte Carlo method can dramatically higher.

Off-policy Sarsa

The variance of return in off-policy TD is lower than MC’s one.

Link to original

Expected Sarsa

Definition

Q(S_{t}, A_{t}) &\leftarrow Q(S_{t}, A_{t}) + \alpha [R_{t+1} + \gamma E_{\pi}[Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_{t}, A_{t})]\\ &(= Q(S_{t}, A_{t}) + \alpha [R_{t+1} + \gamma \sum\limits_{a \in \mathcal{A}(S_{t+1})} Q(S_{t+1}, a)\pi(a | S_{t+1}) - Q(S_{t}, A_{t})) \end{aligned}$$ Expected Sarsa uses the expected value over the next state-action pairs, so all actions viable in the state are considered. Expected Sarsa performs better than [[Sarsa]] in terms of the variance of its [[Return]] but costs more.Link to original

Double Q-Learning

Definition

Q-Learning takes the action with the highest value, which tends to overestimate the value of actions (maximization bias) in the early stages of learning, slowing the convergence to .

Double Q-learning uses two Q-value functions to overcome this problem: one for action selection and another for value evaluation.

Algorithm

  1. Initialize , , and arbitrarily, and and .
  2. Repeat for each episode:
    1. Initialize .
    2. Repeat for each step of an episode until is terminal:
      1. Choose an action from the initial state using policy derived from and (e.g. -greedy in ).
      2. Take the action and observe a reward and a next state .
      3. With probability : else:
      4. Update

Examples

Q-Learning initially learns to take the left action much more often than the right action despite the lower true state-value of the left. Even at asymptote, it takes the left action more than optimal (5%). In contrast, double Q-learning is unaffected by maximization bias.

Link to original

n-Step Bootstrapping

n-Step Return

Definition

G_{t:t+n} :&= \sum\limits_{k=0}^{n-1}\gamma^{k} R_{t+k+1} + \gamma^{n}V(S_{t+n}) \\ &= R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^{n}V(S_{t+n}) \end{aligned}$$ n-step returns can be considered approximations to the full [[Return]], truncated after $n$ steps and then corrected for remaining missing terms by the [[State-Value Function|state-value]] estimator $V(S_{t+n})$. # Examples - $G_{t:t+1}$: (one-step) [[Temporal Difference Learning|TD]] target - $G_{t:t+n}$: [[n-Step TD]] target - $G_{t:\infty}$: [[Monte Carlo Method|MC]] targetLink to original

n-Step TD

Definition

where is the n-Step Return.

The n-step TD learning is a method that bridges the gap between TD and MC methods.

Examples

Larger reduces bias but increases variance.

Link to original

n-Step Sarsa

Definition

where is the n-Step Return.

Link to original

n-step Off-Policy Learning

Definition

The relative probability of the n-step trajectory under the target and behavior policies is

Algorithms

Off-policy n-Step TD

where is the importance weight.

Off-policy n-Step Sarsa

where is the importance weight, and .

Link to original

Eligibility Traces

lambda-Return

Definition

The -return is a way to combine -step returns for different ‘s using a weighted average.

G_{t}^{\lambda} :&= (1-\lambda) \sum\limits_{n=1}^{\infty}\lambda^{n-1} G_{t:t+n}\\ &= (1-\lambda) \sum\limits_{n=1}^{T-t-1}\lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1}G_{t} \end{aligned}$$ where $G_{t:t+n}$ is the [[n-Step Return]]. # Examples - $\lambda=0$: [[Temporal Difference Learning|TD]] target - $\lambda =1$: [[Monte Carlo Method|MC]] return - $0 < \lambda < 1$: creates a smooth blend of all $n$-step returns.Link to original

TD(lambda)

Definition

TD() is a TD algorithm that uses the lambda-Return for value function updates. where is the lambda-Return

Examples

Link to original

Deep Reinforcement Learning

DQNs

Deep Q-Network

Definition

800

DQN has large or continuous state space and discrete action space.

Naive DQN

Naive DQN treats as a target and minimizes MSE loss by SGD.

Due to the training instability and correlated samples, the Naive DQN has very poor results, even worse than a linear model.

Experience Replay

Online RL agents incrementally update parameters while observing a stream of experience. This structure cause strongly temporally-correlated updates, breaking i.i.d. assumption, and rapidly forget rare experiences that would be useful later on. Experience replay stores experiences in the replay buffer, and randomly samples temporally uncorrelated minibatches from the replay buffer when learning.

Target Network

If target function is changed too frequently, then this moving target makes training difficult (non-stationary target problem). The target network technique updates the parameters of the behavior Q-network at every step, while updating the parameters of the target Q-network sporadically (e.g. every steps). where is the size of a minibatch, and is a -sized index set drew from the replay buffer.

Algorithm

  1. Initialize behavior network and target network with random weights , and the replay buffer to max size .
  2. Repeat for each episode:
    1. Initialize sequence .
    2. Repeat for each step of an episode until terminal, :
      1. With probability , select a random action otherwise select .
      2. Take the action and observe a reward and a next state .
      3. Store transition in the replay buffer .
      4. Sample random minibatch of transitions from .
      5. .
      6. Perform Gradient Descent on loss .
      7. Update the target network parameter every steps.
      8. Update .
Link to original

Double DQN

Definition

Double DQN is a variation of DQN that uses the idea of Double Q-Learning to reduce overestimation. In this case, the target network in DQN works as the second network for double Q-learning without introducing an additional network.

Algorithm

  1. Initialize behavior network and target network with random weights , and the replay buffer to max size .
  2. Repeat for each episode:
    1. Initialize sequence .
    2. Repeat for each step of an episode until terminal, :
      1. With probability , select a random action otherwise select .
      2. Take the action and observe a reward and a next state .
      3. Store transition in the replay buffer .
      4. Sample random minibatch of transitions from .
      5. .
      6. Perform Gradient Descent on loss .
      7. Update the target network parameter every steps.
      8. Update .
Link to original

TD Error

Definition

The TD error is the difference between the target and the current prediction.

TD error for State-Value Function

TD error for SARSA

TD error for Q-Learning

Link to original

Prioritized Replay

Definition

Prioritized replay is an enhancement to the standard experience replay. Instead of uniformly sampling experiences from the replay buffer, prioritized replay samples important transitions more frequently based on their priority values. It makes it possible to learn more efficiently as the agent focuses on important experiences more.

The priority of transition is calculated by TD Error. where is the TD Error, and .

The probability of sampling transition is where controls how much prioritization is used (uniform sampling if ).

However, it can lead to a loss of diversity and introduce bias, but these issues can be alleviated and corrected with stochastic sampling prioritization (using ) and Importance Sampling weights.

The Importance Sampling weights are calculated as where

  • is the size of the replay buffer
  • is annealing parameter that fully compensates the bias when . It starts from and ends with .

The wight is normalized by before used for stability.

Algorithm

Double DQN with Prioritized Replay

  1. Initialize behavior network and target network with random weights , and the replay buffer to max size .
  2. Repeat for each episode:
    1. Initialize sequence .
    2. Repeat for each step of an episode until terminal, :
      1. With probability , select a random action otherwise select .
      2. Take the action and observe a reward and a next state .
      3. Store transition in the replay buffer with maximal priority .
      4. For every (replay period) steps:
        1. Sample random minibatch of transitions from .
        2. Compute the importance weight
        3. Compute TD Error
        4. Update transition priority
        5. Accumulate weight-change
      5. Update behavior network weights and reset
      6. Update the target network parameter every steps.
      7. Update .
Link to original

Dueling DQN

Definition

Dueling DQN has two streams: advantage and action-independent state-value , sharing a feature encoder (CNN), and combined by an aggregator to produce action-value .

Aggregating module is unidentifiable in the sense that given a Q-value function, there are multiple possible decompositions into value and advantage functions ( where is a constant value). This makes learning process unstable and less efficient. To force a unique decomposition, we introduce a constraint that makes the Advantage Function have zero-mean.

Link to original

REINFORCE

Policy Gradient

Definition

Policy gradient algorithms directly learn the optimal policy by a parametric probability distribution . The policy stochastically selects an action in a state according to a parameter . It typically proceeds by sampling a stochastic policy and adjusting the parameter in the direction of maximizing the total reward.

The objective function of policy gradient algorithm is defined as: where is a trajectory, is the total reward of the trajectory , and is the PDF of .

Link to original

Policy Gradient Theorem

Definition

The derivative of the expected total reward is the expectation of the product of total rewards and summed gradients of log of the policy .

Proof

Link to original

REINFORCE

Definition

REINFORCE algorithm is a policy gradient algorithm that maximizes the expected return. The objective function of REINFORCE algorithm based on the Policy Gradient Theorem. It substitutes the expectation and the total reward of Policy Gradient with averaging and returns .

Algorithm

  1. Execute trajectories (Each starts from a state under the policy ).
  2. Approximate the gradient of the objective function
  3. Update policy to maximize where is a learning rate.
Link to original

REINFORCE with Baseline

Definition

REINFORCE with Baseline algorithm is a variant of the REINFORCE algorithm that helps reduce variance in Policy Gradient methods by adopting Actor-Critic Method. It modifies the objective function of the REINFORCE algorithm by subtracting a baseline from the returns, which helps reduce the variance of the gradient without introducing bias. where is a baseline function not related to (commonly chosen as a State-Value Function ).

Actor (Policy Gradient Update)

Critic (Value Network Update)

Algorithm

  1. Initialize state-value and policy networks randomly.
  2. Set the hyperparameters: step-sizes , and discount factor
  3. Repeat for each episode (Each starts from a state under the policy ):
    1. Repeat for each step of an episode until terminal, :
      1. (minimize )
      2. (maximize with the Policy Gradient )
Link to original

Actor-Critics

Actor-Critic Method

Definition

Actor-Critic method consists of two networks: actor and critic networks.

Actor network updates parameter for policy by maximizing using Policy Gradient

Critic network updates parameter for value function by minimizing

Examples

REINFORCE with Baseline

Actor-Critic Method with TD(0) Return

Asynchronous Advantage Actor-Critic Method

Deterministic Policy Gradient

Deep Deterministic Policy Gradient

Link to original

Actor-Critic Method with TD(0) Return

Definition

Actor (Policy Gradient Update)

Critic (Value Network Update)

Algorithm

  1. Initialize critic and actor networks randomly.
  2. Set the hyperparameters: step-sizes , and discount factor
  3. Repeat for each episode (Each starts from a state under the policy ):
    1. Repeat for each step of an episode until terminal, :
      1. Select action according to policy .
      2. Take the action and observe a reward and a next state .
Link to original

Asynchronous Advantage Actor-Critic Method

Definition

A3C (Asynchronous Advantage Actor-Critic Method) is an Actor-Critic Method that utilizes multiple networks which are a global network and multiple worker agents working in parallel across multiple instances of the environment, and agents update asynchronously the global network parameter. The parallelism reduce each agent’s temporal correlation. The return estimates the Advantage Function using the n-Step Return instead of Action-Value Function .

Actor (Policy Gradient Update)

where is an Entropy that encourages exploration

Critic (Value Network Update)

Algorithm

For each worker agent

  1. Let be global network’s shared parameters, and let be a worker agent’s parameters.
  2. Set the hyperparameters: step-sizes , discount factor , regularization factor , maximum steps per update .
  3. Repeat:
    1. Reset gradients and .
    2. Synchronize parameters and .
    3. Set and get state .
    4. For :
      1. Select action according to policy .
      2. Take the action and observe a reward and a next state .
    5. (or for terminal )
    6. For :
      1. .
      2. Accumulate gradients with respect to
      3. Accumulate gradients with respect to
    7. Update asynchronously and .
Link to original

DPGs

Deterministic Policy Gradient

Definition

Deterministic Policy Gradient (DPG) learns a deterministic policy as an actor on continuous action spaces and an Action-Value Function as a critic.

DPG requires fewer samples to approximate the gradient than stochastic Policy Gradient because DPG updates the parameter only over the state space, according to the Deterministic Policy Gradient Theorem.

Link to original

State Visitation Frequency

Definition

State visitation frequency is the discounted sum of probabilities of visiting a given state under policy . where is the initial and is the visitation probability from to in step following .

can be considered as a Distribution Function over the state space.

Link to original

Deterministic Policy Gradient Theorem

Definition

\nabla_{\theta}J(\theta) &= \nabla_{\theta} E_{s\sim \rho_{\mu}}[Q(s, a)|_{a=\mu(s;\theta)}]\\ &= \int_{S} \rho_{\mu}(s)\nabla_{a}Q(s, a)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)ds\quad\text{(by chain rule)}\\ &= E_{s\sim\rho_{\mu}}[\nabla_{a}Q(s, a)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)] \end{aligned}$$Link to original

Deep Deterministic Policy Gradient

Definition

Deep Deterministic Policy Gradient (DDPG) is an Actor-Critic Method based on DPG that learns deterministic policies to handle continuous action space by adapting DQN architecture such as experience replay and target network.

Due to the determinacy of the policy of DDPG, extra noise is necessary for exploration. The exploration policy is constructed by the sum of the determinant action and a noise process. where is a noise process.

The target network updating in DQN is substituted by soft (gradual) updating.

\hat{\phi} &\leftarrow \lambda\phi + (1-\lambda)\hat{\phi}\\ \hat{\theta} &\leftarrow \lambda\theta + (1-\lambda)\hat{\theta} \end{aligned}$$ ## Actor (Policy Gradient Update) $$J(\theta) = E_{s\sim \rho_{\mu}}[Q(s, a; \phi)]\Big|_{a = \mu(s;\theta)}$$ $$\Delta\theta = \nabla_{a}Q(s, a;\phi)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)$$ Where: - $\phi$: behavior critic network. - $\theta$: behavior actor network. ## Critic (Value Network Update) $$L(\phi) = E_{s\sim \rho_{\mu}}[r + \gamma \hat{Q}(s', \hat{\mu}(s';\hat{\theta}); \hat{\phi}) - Q(s, a; \phi)]^{2}\Big|_{a = \mu(s;\theta)}$$ $$\Delta\phi = (r + \gamma \hat{Q}(s', \hat{\mu}(s';\hat{\theta}); \hat{\phi}) - Q(s, a; \phi))\nabla_{\phi}Q(s, a; \phi)$$ Where: - $\phi$: behavior critic network. - $\theta$: behavior actor network. - $\hat{\phi}$: target critic network. - $\hat{\theta}$: target actor network. # Algorithm 1. Initialize critic network $Q(s, a; \phi)$ and actor network $\mu(s; \phi)$ randomly, target networks $\hat{Q}, \hat{\mu}$ with weights $\hat{\phi} = \phi, \hat{\theta} = \theta$, and the replay buffer $\mathcal{R}$ to max size $N$. 3. Repeat for each episode: 1. Initialize state $s_{0}$ and a random noise process $\mathcal{N}$. 2. Repeat for each step of an episode until terminal, $t=0, 1, \dots, T-1$ : 1. Select an action $a_{t} = \mu(s_{t}; \theta) + \mathcal{N}_{t}$. 2. Take the action $a_{t}$ and observe a reward $r_{t+1}$ and a next state $s_{t+1}$. 3. Store transition $(s_{t}, a_{t}, r_{t+1}, s_{t+1})$ in the replay buffer $\mathcal{R}$. 4. Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i+1}, s_{i+1})$ from $\mathcal{R}$. 5. $y_{i} \leftarrow r_{i+1} + \gamma \hat{Q}(s_{i+1}, \hat{\mu}(s_{i+1};\hat{\theta}); \hat{\phi})$. 6. Update critic network by minimizing the loss $L = \cfrac{1}{B}\sum\limits_{i}(y_{i} - Q(s_{i}, a_{i}; \phi))^{2}$. 7. Update actor network using the [[Deterministic Policy Gradient]] $\nabla_{\theta}J \approx \cfrac{1}{B}\sum\limits_{i}\nabla_{a}Q(s_{i}, a;\phi)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s_{i};\theta)$ 8. Update target networks $\hat{\phi} \leftarrow \lambda\phi + (1-\lambda)\hat{\phi}$ $\hat{\theta} \leftarrow \lambda\theta + (1-\lambda)\hat{\theta}$Link to original

TRPO

Trust Region Policy Optimization

Definition

The performance of DDPG does not improve monotonically. TRPO addresses the problem by adopting two concepts: Minorize-Maximization Algorithm and trust region. TRPO updates the policy parameter on a trust region in the policy space, guaranteeing monotonic improvement of the objective function (expected return). Although TRPO has achieved high performance, its implementation is very complex so being not practical.

TRPO starts from the maximization problem of expected return

The original problem is approximated multiple times to find a surrogate function and be solved by MM Algorithm. The surrogate function satisfying the trust region constraint is found using the conservative policy iteration update. where is a local approximation of , and is a State Visitation Frequency.

The KL penalized optimization problem is transformed to a constrained optimization problem by Duality.

By approximations, the final optimization problem is

The optimization problem is solved by the natural policy gradient and check the monotonic improvement with Backtracking Line Search where is a policy gradient, is the Fisher Information Metric, and is the constraint threshold.

TRPO is performed by repeating 3 steps in each iteration

  1. Collect trajectories on , and update -values.
  2. By averaging over samples, construct the estimated objective and constraint.
  3. Approximately solve the constrained optimization by natural policy gradient and Backtracking Line Search to update the policy parameter .

Algorithm

  1. Initialize policy parameters .
  2. For :
    1. Collect a set of trajectories under current policy .
    2. Compute advantages using an advantage estimation algorithm.
    3. Compute policy gradient and Fisher Information Metric .
    4. Compute natural gradient .
    5. Estimate proposed step
    6. Perform Backtracking Line Search to obtain final update: 2. Find minimum such that the KL constraint is satisfied.
    7. Update the policy
Link to original

MM Algorithm

Definition

The MM algorithm is an iterative optimization method which exploits the convexity of a function in order to find its maxima or minima. The MM stands for Majorize-Minimization or Minorize-Maximization, depending on whether the desired optimization is a minimization or a maximization.

The MM algorithm works by finding a surrogate function that minorizes or majorizes the objective function. Optimizing the surrogate function will monotonically improve the value of the objective function.

Algorithm

Consider Minorize-Maximization algorithm for a maximization problem of a function .

  1. Find a surrogate function (minorized ) satisfying:
  2. Find a point maximizes the surrogate function Then,
  3. Use the maximum point as the next point.
  4. Re-evaluate the surrogate function at the new point and repeat Iteration until convergence to a (local) maximum.
Link to original

Natural Gradient

Definition

The natural gradient is an optimization method that takes into account the geometry of the parameter space when updating parameters. Unlike the standard gradient, which points in the direction of the steepest ascent in the Euclidean Space of parameters, the natural gradient points in the direction of the steepest ascent in the space of probability Distribution induced by the parameters.

The natural gradient uses the Fisher Information matrix to define a Metric on the parameter space (Fisher Information Metric). It can be interpreted as the steepest ascent direction in the space of probability distributions, as measured by the KL-Divergence. Instead of taking a step that maximizes the change in the objective function in the Euclidean space of parameters, the natural gradient takes a step that maximizes the change in while keeping the change in the probability distribution, as measured by KL divergence, relatively small.

Link to original

PPO

Proximal Policy Optimization

Definition

Although TRPO has achieved monotonic improvement, its implementation and computation are complicated due to KL constraint. PPO removed the KL constraint term in the optimization problem of TRPO by introducing the clipped surrogate objective, which provides penalty when the policy update is too high.

PPO apply clipping to the policy ratio . The cliiped surrogate function is defined as where is the policy ratio, and is a hyperparameter that determines the clipping range.

satisfies , so we can still use MM Algorithm.

To minimize the value function loss simultaneously and to encourage exploration, PPO appends additional terms to the objective function. Where

  • is the clipped surrogate objective
  • is the value function loss
  • is the Entropy bonus terms.
  • are hyperparameter that controls the weight of the losses.
Link to original

Distributional RL

Distributional Reinforcement Learning

Definition

Distributional RL treats the Reward as a Random Variable, and uses random return , called an action-value distribution, instead of the Action-Value Function.

Algorithms

C51

QR-DQN

IQN

Link to original

Distributional Bellman Equation

Definition

Distributional Bellman Equation

Distributional Bellman Operator

The distributional Bellman equation can be written with the operator as

Link to original

Distributional Bellman Optimality Equation

Definition

Distributional Bellman Optimality Equation

Distributional Bellman Optimality Operator

The distributional Bellman optimality equation can be written with the operator as

Link to original

Distributional Policy Iteration

Definition

Policy Evaluation

Compute from the current policy .

Distributional Bellman operator for a fixed policy is a Contraction Map in a maximal form of Wasserstein Metric.

Thus, the iteration converges to a unique fixed point (by Contraction Lemma).

Control (Policy Improvement)

Seek a greedy policy based on the , which maximizes the expectation of .

Distributional Bellman optimality operator is not a Contraction Map, not even continuous in any Metric, causing the distributional instability. To overcome the problem, discrete action-value distributions are used to approximate the true distribution .

Link to original

C51

C51

Definition

The C51 algorithm is a Distributional Reinforcement Learning algorithm that approximates the distribution of the random return a using discretized distribution.

Architecture

The range of possible returns is divided into a fixed support of -equally spaced bins , where and . and are the minimum and maximum possible return values, respectively. In the C51 algorithm, the support determined once at the beginning and remains fixed throughout the entire learning process.

For each state-action pair , the algorithm estimates the probabilities that the return will fall into each -th bin. The probabilities are estimated by a neural network, which takes the state as input and outputs a vector of probabilities for each state-possible action pair . The Softmax Function ensures the properties of the probability of the output for each state-action pair. where are the raw outputs (logits) of the network for the -th atom and action , and represents the network’s parameters.

The discrete distribution over the fixed support is constructed as where is a Dirac’s delta function at .

Projection

In a policy evaluation process, discrete distributions have a problem: and have disjoint supports. Though, Wasserstein Metric in the original setup is robust to this issue, but in practice, KL-Divergence is used instead due to the distributional instability and differentiability. Therefore, the problem still matters.

The disjoint support problem is solved by projection.

  1. Compute the distributional Bellman update: where
  2. Distribute the computed probability mass to the neighboring bins in the support proportionally to the distance from each original point: , where is the projection operator.

Loss Function

C51 uses the KL-Divergence as a loss function. where and are the parameters for behavior and target networks, respectively.

The loss function can be simplified to the Cross-Entropy Loss.

Algorithm

  1. Initialize behavior network and target network with random weights , and the replay buffer to max size .
  2. Repeat for each episode:
    1. Initialize sequence .
    2. Repeat for each step of an episode until terminal, :
      1. With probability , select a random action otherwise select .
      2. Take the action and observe a reward and a next state .
      3. Store transition in the replay buffer .
      4. Sample a random transition from .
      5. Compute and select a greedy action .
      6. Perform Gradient Descent on the loss .
      7. Update the target network parameter every steps.
      8. Update .
Link to original

QR-DQN

QR-DQN

Definition

The Quantile-Regression DQN (QR-DQN) algorithm is a Distributional Reinforcement Learning algorithm that approximates the distribution of random return using quantile regression.

Architecture

QR-DQN estimates a set of quantiles of return distribution, where represents the midpoint of -th quantile interval. This can be seen as adjusting the location of the supports of a uniform probability mass to approximate the desired quantile distribution.

The quantile distribution with uniform probability is constructed as where is a Dirac’s delta function at , and are the outputs of the network, representing the estimated quantile values. These values are obtained by applying the inverse CDF of the return distribution to the quantile midpoints .

Using the estimated quantile values as a support minimizes the Wasserstein Distance between the true return distribution and the estimated distribution.

Quantile Regression

Given data set , a -quantile minimizes the loss , where is a quantile loss function.

The quantile values are estimated by minimizing the quantile Huber loss function. Given a transition , the loss is defined as Where:

  • and .
  • where is Huber Loss.

Algorithm

  1. Initialize behavior network and target network with random weights , and the sample size .
  2. Repeat for each episode:
    1. Initialize sequence .
    2. Repeat for each step of an episode until terminal, :
      1. With probability , select a random action otherwise select .
      2. Take the action and observe a reward and a next state .
      3. Store transition in the replay buffer .
      4. Sample a random transition from .
      5. Select a greedy action
      6. Compute the target quantile values .
      7. Perform Gradient Descent on the loss .
      8. Update the target network parameter every steps.
      9. Update .
Link to original

IQN

IQN

Definition

The Implicit Quantile Network (IQN) algorithm is a Distributional Reinforcement Learning algorithm that implicitly estimates the quantiles of a return distribution by learning a function that maps a quantile fraction to the corresponding quantile value.

Architecture

IQN estimates the quantile value for a given state , action , and a quantile fraction . The output is the estimated quantile value corresponding to that fraction .

The input state processed by the encoder layers of neural network to produce a state embedding vector .

The quantile fraction is embedded into a higher-dimensional vector using a set of basis functions, where whose dimension is the same as the one of the state-feature . The two embeddings: state embedding and quantile-embedding, are combined using Hadamard Product .

The combined embedding is fed into the further layers to predict the quantile value

In summary, the quantile value for a given state is estimated by

By sampling different quantile fractions , IQN can approximate the entire return distribution. The expectation of the return can be approximated by averaging the estimated quantile values over the sampled quantile fractions where are sampled quantile fractions.

Loss Function

IQN uses a quantile Huber loss function, similar to QR-DQN. Given a transition , the loss is defined as Where:

  • and are quantile fractions for the current state-action pair and next state, respectively.
  • and .
  • where is Huber Loss.

Risk-Sensitive Policy

The IQN has information about the entire distribution by estimating the distribution of returns. This property makes it suitable for implementing risk-sensitive policies, allowing for decision-making considering risk. Risk-sensitive policies focus on not just the expected return, but the variability or uncertainty in the returns. IQN makes it feasible by providing estimated quantiles of the return distribution.

The function is used to define a distortion risk measure that focuses on specific parts of the return distribution. By applying to the quantile fraction , we can re-wright each outcome or change the sampling distribution.

These are examples of distortion measures:

  • Cumulative Probability Weighting Parametrization (CPW):
  • Wang: where is the CDF for standard normal distribution.
  • Power Formula:
  • Conditional Value at Risk (CVaR):

Algorithm

  1. Initialize behavior quantile network and target quantile network with random weights , the sample sizes , and a distortion measure .
  2. Repeat for each episode:
    1. Initialize sequence .
    2. Repeat for each step of an episode until terminal, :
      1. Sample quantile fractions under the distortion measure , | and select a greedy action .
      2. Take the action and observe a reward and a next state .
      3. Store transition in the replay buffer .
      4. Sample a random transition from .
      5. Sample quantile fractions for the next state and select a greedy action .
      6. Compute the target quantile values .
      7. Perform Gradient Descent on the loss .
      8. Update the target network parameter every steps.
      9. Update .
Link to original