Bellman equations and Markov Decision Process
Bellman equations and Markov decision process
Recall: Markov Reward Process
Markov Reward Process is Markov Chain with rewards. it is a tuple of $(S, T, R, \gamma)$
- $S$ is a set of Markov states $(s \in S)$
- $T$ is transition model (or dynamics) that specifies $T(t, s_1, s_2) = T_{s_1s_2} \doteq P(s_{t+1} = s_2 \vert s_t = s_1)$
- $R$ is a reward function $R(s_t = s) = E[r_t \vert s_t = s]$
- $\gamma$ is Discount factor $\gamma \in [0, 1]$
Note: - There is no actions - If $S$ is finite, then $R$ can be represented by a vector - By Markov property, the transition model is independent of $t$ so that it can be expresed by $T(t, s_1, s_2) = T(s_1, s_2) = P(s_{t+1} = s_2 \vert s_t = s_1)$ - Also same as $R(s_t = s) = R(s)$
Markov Reward Process for Finite State
If $S$ is finite, $S$ can be represented by a vector: \(S=\{s_1, s_2, \dots, s_N \}\) And the transition model $T$ can be represented by a matrix:
\[T = \begin{bmatrix} P(s_1 \vert s_1) & P(s_2 \vert s_1) & \cdots & P(s_N \vert s_1) \\ P(s_1 \vert s_2) & P(s_2 \vert s_2) & \cdots & P(s_N \vert s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1 \vert s_N) & P(s_2 \vert s_N) & \cdots & P(s_N \vert s_N)\end{bmatrix}\]Also, the reward function $R$ can be represented by a vector: \(R = \big( R(s_1), R(s_2, \cdots, R(s_N) \big)^T\)
Computing return from rewards
| $R$ is a reward function $$ R(s) = E[r_t | s_t = s]$$ We can calculate total discounted rewards from $t$ be the return $G_t$: | 
- Time of horizon: number of time steps in each episode- Finite ($t \leq T$) or infinite ($t \rightarrow \infty$)
- If finite, MRP is called finite MRP
 
State-value function for MRP
- State-value function $V$: expected discounted sum of future rewards from initial state $s$ \(\begin{aligned} V(s_t = s) &= E[G_t \vert s_t = s] \\ &= E[r_t + \gamma r_{t+1} + \gamma^2r_{t+2} + \gamma^3 r_{t+3} + \cdots \vert s_t = s] \end{aligned}\)
- By Markov property, $V(s_t = s)$ is independent of the initial step $t$ so that we can use simpler notation $V(s)$ instead of $V(s_t = s)$.
Bellman Equation for MRP
The value function can be decomposed by: - Immediate reward ($r_t$) - Discounterd value of subsequent states ($\gamma G_{t+1}$)
The Bellman equation for MRP is: \(\begin{aligned} V(s) &= E[G_t \vert s_t = s] \\ &= E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots \vert s_t = s] \\ &= E[r_t + \gamma G_{t+1} \vert s_t = s] \\ &= E[r_t | s_t = s] + \gamma E[ E[G_{t+1} \vert s_{t+1}] \vert s_t = s] \\ &= R(s) + \gamma E[V(s_{t+1}) \vert s_t = s] \end{aligned}\)
If we have finite state, the Bellman equation of state-value function for MRP can be represented by vectors and the transition matrix:
\[\begin{bmatrix} V(s_1) \\ \vdots \\ V(s_N) \end{bmatrix} = \begin{bmatrix} R(s_1) \\ \vdots \\ R(s_N) \end{bmatrix} + \gamma \begin{bmatrix} T_{11} & \cdots & T_{1N} \\ T_{21} & \cdots & T_{2N} \\ \vdots & \ddots & \vdots \\ T_{N1} & \cdots & T_{NN} \end{bmatrix} \begin{bmatrix} V(s_1) \\ \vdots \\ V(s_N) \end{bmatrix}\]Note : $T_{ij} = P(s_{t+1} = s_j \vert s_t = s_i)$
It can simplified by: \(V = R + \gamma T V\)
How to solve Bellman Equation
Solving the linear system by using linear solvers: \(V = R + \gamma T V \\ (\mathbb{I}- \gamma T) V = R \\ V = (\mathbb{I} - \gamma T)^{-1} R\)
Note: If $\gamma < 1$, then $\mathbb{I}- \gamma T$ is invertible
Computation Complexity is $O(n^3)$ where $n$ is the number of states. If $n$ is huge, then it is not proper to solve the linear system directly. Instead, we can use iterative methods like Dynamic Programming.