Definition

Q-learning directly learns by using Bellman Optimality Equation. The behavior policy of Q-learning is derived from while the target policy is , so it is an off-policy method.
Algorithm
- Initialize , arbitrarily, and .
- Repeat for each episode:
- Initialize .
- Repeat for each step of an episode until is terminal:
- Choose an action from the initial state using policy derived from (e.g. -greedy).
- Take the action and observe a reward and a next state .
- Update