My Knowledge Base

❯

❯

Markov Decision Processes

Markov Decision Processes

Mar 12, 20261 min read

math/probability/stochastic_process
machine_learning/reinforcement_learning

Definition

Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows)

Markov decision process (MDP) is a tuple $(S, A, P, R)$ . Where:

$S$ : A set of states (State space)
$A$ : A set of actions (Action space)
$P_{s s^{'}}^{a} = p (s^{'} ∣ s, a) = P (S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a)$ : A Transition Probability from $s$ to $s^{'}$ given $a$ .
$R_{s s^{'}}^{a} = r (s, a, s^{'})$ : The immediate Reward received after transitioning from state $s$ to $s^{'}$ due to action $a$ .

Facts

Environmental model in MDP is the Transition Probability. If the Transition Probability is known, it is called a model-based MDP, otherwise called a model-free MDP.

Any Markov decision process satisfies the followings:

There exists an optimal policy $\exists π_{*} s.t. \forall π, \forall s, v_{π_{*}} (s) \geq v_{π} (s)$

All optimal policies achieve the optimal state-value function $v_{π_{*}} (s) = v_{*} (s)$

All optimal policies achieve the optimal action-value function $q_{π_{*}} (s, a) = q_{*} (s, a)$

Graph View

Definition
Facts

Backlinks

Bellman Optimality Equation
Policy
Reinforcement Learning Note

Created with Quartz v4.5.1 © 2026

GitHub
Discord Community