Definition
Deep Deterministic Policy Gradient (DDPG) is an Actor-Critic Method based on DPG that learns deterministic policies to handle continuous action space by adapting DQN architecture such as experience replay and target network.
Due to the determinacy of the policy of DDPG, extra noise is necessary for exploration. The exploration policy is constructed by the sum of the determinant action and a noise process. where is a noise process.
The target network updating in DQN is substituted by soft (gradual) updating.
\hat{\phi} &\leftarrow \lambda\phi + (1-\lambda)\hat{\phi}\\ \hat{\theta} &\leftarrow \lambda\theta + (1-\lambda)\hat{\theta} \end{aligned}$$ ## Actor (Policy Gradient Update) $$J(\theta) = E_{s\sim \rho_{\mu}}[Q(s, a; \phi)]\Big|_{a = \mu(s;\theta)}$$ $$\Delta\theta = \nabla_{a}Q(s, a;\phi)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s;\theta)$$ Where: - $\phi$: behavior critic network. - $\theta$: behavior actor network. ## Critic (Value Network Update) $$L(\phi) = E_{s\sim \rho_{\mu}}[r + \gamma \hat{Q}(s', \hat{\mu}(s';\hat{\theta}); \hat{\phi}) - Q(s, a; \phi)]^{2}\Big|_{a = \mu(s;\theta)}$$ $$\Delta\phi = (r + \gamma \hat{Q}(s', \hat{\mu}(s';\hat{\theta}); \hat{\phi}) - Q(s, a; \phi))\nabla_{\phi}Q(s, a; \phi)$$ Where: - $\phi$: behavior critic network. - $\theta$: behavior actor network. - $\hat{\phi}$: target critic network. - $\hat{\theta}$: target actor network. # Algorithm 1. Initialize critic network $Q(s, a; \phi)$ and actor network $\mu(s; \phi)$ randomly, target networks $\hat{Q}, \hat{\mu}$ with weights $\hat{\phi} = \phi, \hat{\theta} = \theta$, and the replay buffer $\mathcal{R}$ to max size $N$. 3. Repeat for each episode: 1. Initialize state $s_{0}$ and a random noise process $\mathcal{N}$. 2. Repeat for each step of an episode until terminal, $t=0, 1, \dots, T-1$ : 1. Select an action $a_{t} = \mu(s_{t}; \theta) + \mathcal{N}_{t}$. 2. Take the action $a_{t}$ and observe a reward $r_{t+1}$ and a next state $s_{t+1}$. 3. Store transition $(s_{t}, a_{t}, r_{t+1}, s_{t+1})$ in the replay buffer $\mathcal{R}$. 4. Sample random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i+1}, s_{i+1})$ from $\mathcal{R}$. 5. $y_{i} \leftarrow r_{i+1} + \gamma \hat{Q}(s_{i+1}, \hat{\mu}(s_{i+1};\hat{\theta}); \hat{\phi})$. 6. Update critic network by minimizing the loss $L = \cfrac{1}{B}\sum\limits_{i}(y_{i} - Q(s_{i}, a_{i}; \phi))^{2}$. 7. Update actor network using the [[Deterministic Policy Gradient]] $\nabla_{\theta}J \approx \cfrac{1}{B}\sum\limits_{i}\nabla_{a}Q(s_{i}, a;\phi)|_{a=\mu(s;\theta)}\nabla_{\theta}\mu(s_{i};\theta)$ 8. Update target networks $\hat{\phi} \leftarrow \lambda\phi + (1-\lambda)\hat{\phi}$ $\hat{\theta} \leftarrow \lambda\theta + (1-\lambda)\hat{\theta}$