Definition

Prioritized replay is an enhancement to the standard experience replay. Instead of uniformly sampling experiences from the replay buffer, prioritized replay samples important transitions more frequently based on their priority values. It makes it possible to learn more efficiently as the agent focuses on important experiences more.

The priority of transition $i$ is calculated by TD Error. $p_{i} = ∣ δ_{i} ∣ + ϵ$ where $δ$ is the TD Error, and $ϵ \in R^{+}$ .

The probability of sampling transition $i$ is $P (i) = \frac{p _{i}^{α}}{\sum _{k} p _{k}^{α}}$ where $α$ controls how much prioritization is used (uniform sampling if $α = 0$ ).

However, it can lead to a loss of diversity and introduce bias, but these issues can be alleviated and corrected with stochastic sampling prioritization (using $α$ ) and Importance Sampling weights.

The Importance Sampling weights are calculated as $w_{i} = (\frac{1}{N} \cdot \frac{1}{P ( i )})^{β}$ where

$N$ is the size of the replay buffer
$β \in [0, 1]$ is annealing parameter that fully compensates the bias when $β = 1$ . It starts from $0$ and ends with $1$ .

The wight $w_{i}$ is normalized by $\frac{1}{k max w _{k}}$ before used for stability.

Algorithm

Double DQN with Prioritized Replay

Initialize behavior network $Q$ and target network $\hat{Q}$ with random weights $θ$ , and the replay buffer $R$ to max size $N$ .
Repeat for each episode:
1. Initialize sequence $s_{0}$ .
2. Repeat for each step of an episode until terminal, $t = 0, 1, \dots, T - 1$ :
  1. With probability $ϵ$ , select a random action $a_{t}$ otherwise select $a_{t} = a argmax Q (s_{t}, a; θ)$ .
  2. Take the action $a_{t}$ and observe a reward $r_{t + 1}$ and a next state $s_{t + 1}$ .
  3. Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in the replay buffer $R$ with maximal priority $p_{t} = i < t max p_{i}$ .
  4. For every $K$ (replay period) steps:
    1. Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i + 1}, s_{i + 1}) \sim P (i) = \frac{p _{i}^{α}}{\sum _{k} p _{k}^{α}}$ from $R$ .
    2. Compute the importance weight $w_{i} = \frac{( NP ( i ) ) ^{- β}}{k max w _{k}}$
    3. Compute TD Error $δ_{t} = r_{i + 1} + γ \hat{Q} (s_{i + 1}, a argmax Q (s_{i + 1}, a; θ); \hat{θ}) - Q (s_{i}, a_{i}; θ)$
    4. Update transition priority $p_{i} \leftarrow ∣ δ_{i} ∣ + ϵ$
    5. Accumulate weight-change $Δ \leftarrow Δ + w_{i} δ_{i} \nabla_{θ} Q (s_{i}, a_{i}; θ)$
  5. Update behavior network weights $θ \leftarrow θ + η Δ$ and reset $Δ = 0$
  6. Update the target network parameter $\hat{θ} \leftarrow θ$ every $C$ steps.
  7. Update $s_{t} \leftarrow s_{t + 1}$ .

My Knowledge Base

Explorer

Prioritized Replay

Definition

Algorithm

Double DQN with Prioritized Replay

Graph View

Table of Contents

Backlinks