跳到主要内容

贝尔曼方程

贝尔曼方程(Bellman Equation)是强化学习中最核心的数学工具之一,它描述了价值函数之间的递推关系。理解贝尔曼方程对于掌握所有强化学习算法都至关重要。这个方程以 Richard Bellman 的名字命名,他在 1950 年代发展了动态规划理论。

回报与价值

在介绍贝尔曼方程之前,先回顾一下回报和价值的概念。

回报(Return)

回报 GtG_t 是从时刻 tt 开始的累积折扣奖励:

Gt=Rt+1+γRt+2+γ2Rt+3+...=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

回报可以递归地表示为:

Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}

这个递归形式是贝尔曼方程的基础。它告诉我们:当前时刻的回报等于即时奖励加上折扣后的下一时刻回报。

递归关系的直观理解

假设你在玩一个游戏,每一步都能获得一些分数。从当前时刻开始,你最终能获得的总分数等于:

  1. 这一步获得的分数 Rt+1R_{t+1}
  2. 加上从下一步开始能获得的总分数 Gt+1G_{t+1},但需要打个折扣 γ\gamma

为什么要折扣?因为未来的分数不如现在的分数确定,而且我们通常更关心近期的收益。

价值函数回顾

状态价值函数 Vπ(s)V^\pi(s) 是回报的期望:

Vπ(s)=Eπ[GtSt=s]V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

动作价值函数 Qπ(s,a)Q^\pi(s,a)

Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

贝尔曼期望方程

贝尔曼期望方程描述了给定策略下价值函数的递推关系。这个方程是策略评估的理论基础。

状态价值的贝尔曼方程推导

我们从状态价值的定义出发进行推导:

Vπ(s)=Eπ[GtSt=s]V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

将回报的递归形式代入:

Vπ(s)=Eπ[Rt+1+γGt+1St=s]V^\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]

利用期望的线性性质:

Vπ(s)=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]V^\pi(s) = \mathbb{E}_\pi[R_{t+1} | S_t = s] + \gamma \mathbb{E}_\pi[G_{t+1} | S_t = s]

对第二项使用全期望公式,考虑所有可能的动作和下一状态:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

这就是状态价值的贝尔曼期望方程

方程的直观解读

让我们逐项理解这个方程:

  1. π(as)\pi(a|s):在状态 ss 下选择动作 aa 的概率
  2. P(ss,a)P(s'|s,a):执行动作 aa 后转移到状态 ss' 的概率
  3. R(s,a,s)R(s,a,s'):这次转移获得的即时奖励
  4. Vπ(s)V^\pi(s'):下一状态的长期价值
  5. γ\gamma:折扣因子,降低未来价值的重要性

整个方程的含义是:一个状态的价值等于所有可能的"一步转移"的期望价值,其中每条路径的价值由即时奖励和下一状态的折扣价值组成。

动作价值的贝尔曼方程

类似地,可以推导动作价值的贝尔曼方程:

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a') \right]

或者用状态价值表示:

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γVπ(s)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

import numpy as np

def bellman_expectation_v(env, V, policy, state, gamma=0.99):
value = 0
for action in env.actions:
action_prob = policy.get_prob(state, action)
for next_state in env.states:
transition_prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
value += action_prob * transition_prob * (reward + gamma * V[next_state])
return value

def bellman_expectation_q(env, V, policy, state, action, gamma=0.99):
value = 0
for next_state in env.states:
transition_prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
value += transition_prob * (reward + gamma * V[next_state])
return value

V 与 Q 的关系

通过贝尔曼方程,我们可以清楚地看到 V 和 Q 的关系:

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γVπ(s)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

这两个方程形成了价值函数的完整描述,是许多强化学习算法的基础。

贝尔曼最优方程

贝尔曼最优方程描述了最优价值函数的递推关系,是求解最优策略的理论基础。

最优状态价值

最优状态价值 V(s)V^*(s) 满足:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]

与贝尔曼期望方程不同,这里使用 max\max 而非对动作求期望。这意味着最优策略总是选择能带来最大价值的动作。

最优动作价值

最优动作价值 Q(s,a)Q^*(s,a) 满足:

Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q^*(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a') \right]

最优策略

最优策略可以通过最优动作价值函数直接得到:

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)

对于确定性最优策略:

π(s)=argmaxasP(ss,a)[R(s,a,s)+γV(s)]\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]

贝尔曼最优方程的几何解释

贝尔曼最优方程定义了一个非线性算子 TT^*

(TV)(s)=maxasP(ss,a)[R(s,a,s)+γV(s)](T^* V)(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V(s') \right]

这个算子是一个压缩映射,意味着:

TV1TV2γV1V2||T^* V_1 - T^* V_2||_\infty \leq \gamma ||V_1 - V_2||_\infty

由于 γ<1\gamma < 1,反复应用这个算子会收敛到唯一的不动点 VV^*

贝尔曼方程的矩阵形式

对于有限状态空间,贝尔曼方程可以写成矩阵形式,便于分析和计算。

矩阵表示

设状态价值向量为 VRS\mathbf{V} \in \mathbb{R}^{|S|},则贝尔曼期望方程可以写成:

Vπ=Rπ+γPπVπ\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi

其中:

  • RπRS\mathbf{R}^\pi \in \mathbb{R}^{|S|} 是即时奖励期望向量,Rsπ=aπ(as)sP(ss,a)R(s,a,s)R^\pi_s = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) R(s,a,s')
  • PπRS×S\mathbf{P}^\pi \in \mathbb{R}^{|S| \times |S|} 是状态转移矩阵,Pssπ=aπ(as)P(ss,a)P^\pi_{ss'} = \sum_a \pi(a|s) P(s'|s,a)

解析解

通过矩阵运算可以直接求解状态价值:

Vπ=(IγPπ)1Rπ\mathbf{V}^\pi = (\mathbf{I} - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi

这个解析解的推导:

Vπ=Rπ+γPπVπ\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi

VπγPπVπ=Rπ\mathbf{V}^\pi - \gamma \mathbf{P}^\pi \mathbf{V}^\pi = \mathbf{R}^\pi

(IγPπ)Vπ=Rπ(\mathbf{I} - \gamma \mathbf{P}^\pi) \mathbf{V}^\pi = \mathbf{R}^\pi

Vπ=(IγPπ)1Rπ\mathbf{V}^\pi = (\mathbf{I} - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi

import numpy as np

def solve_value_function_matrix(transition_matrix, reward_vector, gamma=0.99):
num_states = transition_matrix.shape[0]
identity = np.eye(num_states)
V = np.linalg.inv(identity - gamma * transition_matrix) @ reward_vector
return V

class GridWorldMatrix:
def __init__(self, size=4):
self.size = size
self.num_states = size * size
self.num_actions = 4
self.goal = self.num_states - 1

def build_matrices(self, policy):
P = np.zeros((self.num_states, self.num_states))
R = np.zeros(self.num_states)

for s in range(self.num_states):
if s == self.goal:
P[s, s] = 1.0
R[s] = 0.0
continue

for a in range(self.num_actions):
prob = policy[s, a]
next_s = self.get_next_state(s, a)
P[s, next_s] += prob
R[s] += prob * self.get_reward(s, a, next_s)

return P, R

def get_next_state(self, state, action):
row, col = state // self.size, state % self.size

if action == 0:
row = max(0, row - 1)
elif action == 1:
row = min(self.size - 1, row + 1)
elif action == 2:
col = max(0, col - 1)
elif action == 3:
col = min(self.size - 1, col + 1)

return row * self.size + col

def get_reward(self, state, action, next_state):
if next_state == self.goal:
return 1.0
return -0.01

grid = GridWorldMatrix()
uniform_policy = np.ones((grid.num_states, grid.num_actions)) / grid.num_actions
P, R = grid.build_matrices(uniform_policy)
V = solve_value_function_matrix(P, R)
print("状态价值函数(矩阵解法):")
print(V.reshape(4, 4).round(3))

这个解析解的时间复杂度是 O(S3)O(|S|^3),只适用于状态空间较小的情况。对于大规模问题,需要使用迭代方法。

价值迭代算法

贝尔曼最优方程启发了一种求解最优策略的方法:价值迭代。

算法思想

价值迭代通过反复应用贝尔曼最优算子来逼近最优价值函数:

Vk+1(s)=maxasP(ss,a)[R(s,a,s)+γVk(s)]V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V_k(s') \right]

每次迭代,我们都应用一次贝尔曼最优算子 TT^*。由于 TT^* 是压缩映射,价值函数会收敛到 VV^*

收敛性分析

VV^* 是最优价值函数,VkV_k 是第 kk 次迭代的价值函数,则:

Vk+1VγVkV||V_{k+1} - V^*||_\infty \leq \gamma ||V_k - V^*||_\infty

这意味着:

  • 每次迭代,误差至少缩小 γ\gamma
  • 经过 kk 次迭代,误差不超过 γkV0V\gamma^k ||V_0 - V^*||_\infty
  • 收敛速度是指数级的

算法实现

def value_iteration(env, gamma=0.99, theta=1e-6, max_iterations=1000):
V = np.zeros(len(env.states))
iteration = 0

while iteration < max_iterations:
delta = 0
for state in env.states:
if env.is_terminal(state):
continue

v = V[state]
q_values = []

for action in env.actions:
q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])
q_values.append(q)

V[state] = max(q_values)
delta = max(delta, abs(v - V[state]))

iteration += 1

if delta < theta:
print(f"价值迭代在 {iteration} 次迭代后收敛")
break

policy = extract_policy(env, V, gamma)

return V, policy, iteration

def extract_policy(env, V, gamma=0.99):
policy = {}
for state in env.states:
if env.is_terminal(state):
policy[state] = None
continue

q_values = {}
for action in env.actions:
q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])
q_values[action] = q

policy[state] = max(q_values, key=q_values.get)

return policy

策略迭代算法

策略迭代是另一种求解最优策略的方法,它交替进行策略评估和策略改进。

算法步骤

  1. 策略评估:计算当前策略 π\pi 的价值函数 VπV^\pi
  2. 策略改进:根据 VπV^\pi 贪心地改进策略
  3. 重复直到策略不再变化

策略评估

策略评估使用贝尔曼期望方程迭代求解:

Vk+1(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVk(s)]V_{k+1}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V_k(s') \right]

策略改进

根据当前价值函数,选择使 Q 值最大的动作:

π(s)=argmaxasP(ss,a)[R(s,a,s)+γVπ(s)]\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

算法实现

def policy_iteration(env, gamma=0.99, theta=1e-6, max_iterations=100):
policy = {s: env.actions[0] for s in env.states if not env.is_terminal(s)}
V = {s: 0.0 for s in env.states}

for iteration in range(max_iterations):
V = policy_evaluation(env, policy, V, gamma, theta)

policy_stable = True
for state in env.states:
if env.is_terminal(state):
continue

old_action = policy[state]

q_values = {}
for action in env.actions:
q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])
q_values[action] = q

policy[state] = max(q_values, key=q_values.get)

if old_action != policy[state]:
policy_stable = False

if policy_stable:
print(f"策略迭代在 {iteration + 1} 次迭代后收敛")
break

return V, policy

def policy_evaluation(env, policy, V, gamma=0.99, theta=1e-6):
while True:
delta = 0
for state in env.states:
if env.is_terminal(state):
continue

v = V[state]
action = policy[state]

q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])

V[state] = q
delta = max(delta, abs(v - V[state]))

if delta < theta:
break

return V

价值迭代 vs 策略迭代

特性价值迭代策略迭代
每次迭代更新价值函数评估策略 + 改进策略
收敛速度较慢较快
每次迭代开销高(需要完整策略评估)
适用场景状态空间大状态空间小
理论保证收敛到最优收敛到最优

修正策略迭代

修正策略迭代是两种方法的折中,在策略评估时只进行有限次迭代:

def modified_policy_iteration(env, gamma=0.99, k=10, theta=1e-6):
policy = {s: env.actions[0] for s in env.states if not env.is_terminal(s)}
V = {s: 0.0 for s in env.states}

while True:
for _ in range(k):
new_V = {}
for state in env.states:
if env.is_terminal(state):
new_V[state] = 0.0
continue

action = policy[state]
q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])
new_V[state] = q
V = new_V

policy_stable = True
for state in env.states:
if env.is_terminal(state):
continue

old_action = policy[state]

q_values = {}
for action in env.actions:
q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])
q_values[action] = q

policy[state] = max(q_values, key=q_values.get)

if old_action != policy[state]:
policy_stable = False

if policy_stable:
break

return V, policy

示例:求解网格世界

让我们用贝尔曼方程求解一个完整的网格世界问题:

import numpy as np
import matplotlib.pyplot as plt

class GridWorld:
def __init__(self, size=4, goal=None, gamma=0.99):
self.size = size
self.states = list(range(size * size))
self.actions = [0, 1, 2, 3]
self.action_names = ['上', '下', '左', '右']
self.goal = goal if goal else size * size - 1
self.gamma = gamma

def step(self, state, action):
row, col = state // self.size, state % self.size

if action == 0:
row = max(0, row - 1)
elif action == 1:
row = min(self.size - 1, row + 1)
elif action == 2:
col = max(0, col - 1)
elif action == 3:
col = min(self.size - 1, col + 1)

next_state = row * self.size + col
reward = 1.0 if next_state == self.goal else -0.01
done = next_state == self.goal
return next_state, reward, done

def get_transition_prob(self, state, action, next_state):
actual_next, _, _ = self.step(state, action)
return 1.0 if next_state == actual_next else 0.0

def get_reward(self, state, action, next_state):
return 1.0 if next_state == self.goal else -0.01

def is_terminal(self, state):
return state == self.goal

def render_values(self, V):
V_array = np.array([V[s] for s in self.states]).reshape(self.size, self.size)
print("\n状态价值函数:")
print(V_array.round(3))

def render_policy(self, policy):
policy_grid = []
for s in self.states:
if self.is_terminal(s):
policy_grid.append('G')
else:
policy_grid.append(self.action_names[policy[s]])

print("\n最优策略:")
for i in range(self.size):
row = policy_grid[i * self.size:(i + 1) * self.size]
print(' '.join(row))

env = GridWorld(size=4)

print("=" * 50)
print("价值迭代")
print("=" * 50)
V_vi, policy_vi, iters_vi = value_iteration(env)
env.render_values(V_vi)
env.render_policy(policy_vi)

print("\n" + "=" * 50)
print("策略迭代")
print("=" * 50)
V_pi, policy_pi = policy_iteration(env)
env.render_values(V_pi)
env.render_policy(policy_pi)

输出示例:

==================================================
价值迭代
==================================================
价值迭代在 127 次迭代后收敛

状态价值函数:
[[0.914 0.924 0.935 0.947]
[0.924 0.935 0.947 0.959]
[0.935 0.947 0.959 0.972]
[0.947 0.959 0.972 1. ]]

最优策略:
下 下 下 下
下 下 下 下
下 下 下 下
下 下 下 G

贝尔曼方程的变体

异步动态规划

传统的动态规划需要遍历所有状态,而异步动态规划可以:

  • 就地更新:使用最新的价值估计
  • 优先级扫描:优先更新价值变化大的状态
  • 实时动态规划:只更新实际访问的状态
def prioritized_sweeping(env, gamma=0.99, theta=1e-6):
V = np.zeros(len(env.states))
priority_queue = []

for state in env.states:
if not env.is_terminal(state):
heapq.heappush(priority_queue, (0, state))

while priority_queue:
_, state = heapq.heappop(priority_queue)

v = V[state]
q_values = []

for action in env.actions:
q = 0
for next_state in env.states:
prob = env.get_transition_prob(state, action, next_state)
reward = env.get_reward(state, action, next_state)
q += prob * (reward + gamma * V[next_state])
q_values.append(q)

V[state] = max(q_values)

priority = abs(v - V[state])
if priority > theta:
for prev_state in env.get_predecessors(state):
heapq.heappush(priority_queue, (-priority, prev_state))

return V

小结

贝尔曼方程是强化学习的数学核心:

  • 贝尔曼期望方程:描述给定策略下价值函数的递推关系 Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')]

  • 贝尔曼最优方程:描述最优价值函数的递推关系 V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]

  • 价值迭代:通过反复应用贝尔曼最优算子求解最优策略

  • 策略迭代:交替进行策略评估和策略改进

贝尔曼方程为后续的 Q-Learning、DQN 等算法奠定了理论基础。下一章将介绍 Q-Learning 算法,它是一种基于贝尔曼方程的无模型强化学习方法。

参考文献

  1. Bellman, R. (1957). Dynamic Programming. Princeton University Press.
  2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  3. Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.