# Reinforcement Learning Notes: State and Action Values

# 1. State Value / State Value Function

vπ(s)=E[GtSt=s]v_{\pi}(s) = \mathbb{E}[G_t | S_t = s]

  • The state value function vπ(s)v_{\pi}(s) represents the expected discounted return when starting in state ss and following policy π\pi.
  • Core Concept: Discounted return average upon a given policy.
  • Deterministic Scenario: If the policy π(as)\pi(a|s), transition probability p(ss,a)p(s'|s, a), and reward p(rs,a)p(r|s, a) are all deterministic, the State Value is identical to the Return.

# 2. Bellman Equation

The Bellman equation characterizes the relationship among the state-value functions of different states.

# 2.1 Scalar Form

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)],sSv_{\pi}(s) = \sum_{a} \pi(a|s) \left[ \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a)v_{\pi}(s') \right], \forall s \in S

  • π(as)\pi(a|s) is a given policy, which assigned a probability for action aa at state ss.
  • Solving the Bellman equation is called Policy Evaluation.

# 2.2 Matrix-Vector Form

vπ=rπ+γPπvπv_{\pi} = r_{\pi} + \gamma P_{\pi} v_{\pi}

  • vπ=[vπ(s1),,vπ(sn)]Tv_{\pi} = [v_{\pi}(s_1), \dots, v_{\pi}(s_n)]^T is the value function at all states.
  • rπ=[rπ(s1),,rπ(sn)]Tr_{\pi} = [r_{\pi}(s_1), \dots, r_{\pi}(s_n)]^T (Immediate reward under policy π\pi)
    • Calculation: rπ(s)=aπ(as)rp(rs,a)rr_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{r} p(r|s, a)r
  • PπP_{\pi}: State transition matrix under policy π\pi.

# 3. Solving State Values (for a certain policy)

State values can be obtained through:

  • Direct Solution: vπ=(IγPπ)1rπv_{\pi} = (I - \gamma P_{\pi})^{-1} r_{\pi}
  • Iterative Solution: vk+1=rπ+γPπvkv_{k+1} = r_{\pi} + \gamma P_{\pi} v_k, where subscript kk represents iterate step.

# 4. Action Value

The action value function qπ(s,a)q_{\pi}(s, a) is the expected return starting from state ss, taking action aa, and following policy π\pi.

qπ(s,a)=E[GtSt=s,At=a]q_{\pi}(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a]

vπ(s)=aπ(as)qπ(s,a)v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s, a)

qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)q_{\pi}(s, a) = \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) v_{\pi}(s')

  • Compare different actions and select the one with the maximum value (Greedy selection).

# Optimal Policy and Iterative Algorithms

# 1. Bellman Optimal Equation

An optimal policy π\pi^* is optimal if vπ(s)vπ(s)v_{\pi^*}(s) \geq v_{\pi}(s) for all ss and for any other policy π\pi.

# 1.1 Element-wise Form

For all sSs \in S:

v(s)=maxπaπ(as)[rp(rs,a)r+γsp(ss,a)v(s)]=maxπaπ(as)q(s,a)v(s) = \max_{\pi} \sum_{a} \pi(a|s) \left[ \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a)v(s') \right] \\=\max_{\pi} \sum_{a} \pi(a|s) q(s, a)

  • Given probability or system model: p(rs,a),p(ss,a)p(r|s, a), p(s'|s, a)
  • Reward rr
  • Gamma γ\gamma

# 1.2 Matrix-Vector Form

v=maxπ(rπ+γPπv)v = \max_{\pi} (r_{\pi} + \gamma P_{\pi} v)

The optimal solution is a deterministic greedy policy:

  • π(as)=1\pi^*(a|s) = 1 if a=aa = a^*, else 00.
  • a=argmaxaq(s,a)a^* = \text{argmax}_a q(s, a).

f(v):=maxπ(rπ+γPπv)f(v):=\max_{\pi}(r_{\pi}+\gamma P_\pi v)

v=f(v)v=f(v)

where

[f(v)]s=maxπaπ(as)q(s,a),sS\left[f(v)\right]_s=\max_{\pi}\sum_{a} \pi(a|s)q(s, a), s\in S

# 2. Contraction Mapping Theorem

A fixed-point problem: v=f(v)v = f(v).

  • Theorem: If ff is a contraction mapping

f(x1)f(x2)γx1x2, where γ(0,1)||f(x_1) - f(x_2)|| \leq \gamma ||x_1 - x_2||,\text{ where } \gamma\in(0, 1)

, then a unique fixed point xx^* exists.

  • Algorithm: The sequence xk+1=f(xk)x_{k+1} = f(x_k) converges to xx^*, when kk \to \infty.
  • Application: f(v)=vf(v) = v in RL is a contraction mapping fix point problem.

# 3. Value Iteration (VI) Algorithm

vk+1=f(vk)=maxπ(rπ+γPπvk)v_{k+1} = f(v_k) = \max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k)

  • Step 1 Policy update: πk+1=argmaxπ(rπ+γPπvk)\pi_{k+1} = \text{argmax}_{\pi} (r_{\pi} + \gamma P_{\pi} v_k). Use current vkv_k select action that maximize qk(s,a)q_k(s, a)
  • Step 2: vk+1=rπk+1+γPπk+1vkv_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k.
  • Note: vkv_k is not the true state value of the intermediate policy. Since vkv_k does not satisfy Bellman equation.
  • Step 3: Stop iteration when vkv_k changes below the threshold.

q-table explicit expression of q(s,a)q(s, a), list all states and actions on a matrix.

# 4. Policy Iteration (PI) Algorithm

  • Step 0: Initial random π0\pi_0
  • Step 1: Policy Evaluation (PE): Solve

vπk=rπk+γPπkvπk.v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}.

Closed form: vπk=(IγPπk)1rπkv_{\pi_k} = (I - \gamma P_{\pi_k})^{-1} r_{\pi_k}
Iterative: v(j+1)=r+γPv(j)v^{(j+1)} = r + \gamma P v^{(j)}

  • Step 2: Policy Improvement (PI): πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \text{argmax}_{\pi} (r_{\pi} + \gamma P_{\pi} v_{\pi_k}).

# 5. Comparison: VI and PI

When solving for the value function:

  • Value Iteration: Only 1 step of evaluation (j=1j=1) before updating π\pi.
  • Truncated PI: jj steps of evaluation.
  • Policy Iteration: Full evaluation (jj \to \infty) before updating π\pi.