Definition

Q-learning directly learns by using Bellman Optimality Equation. The behavior policy of Q-learning is derived from while the target policy is , so it is an off-policy method.

Algorithm

  1. Initialize , arbitrarily, and .
  2. Repeat for each episode:
    1. Initialize .
    2. Repeat for each step of an episode until is terminal:
      1. Choose an action from the initial state using policy derived from (e.g. -greedy).
      2. Take the action and observe a reward and a next state .
      3. Update