跳到主要内容

策略梯度方法

策略梯度方法是一类直接优化策略函数的强化学习算法。与基于价值的方法不同,策略梯度不学习价值函数,而是直接学习策略参数,使其能够选择更好的动作。

为什么需要策略梯度?

基于价值方法的局限

Q-Learning 和 DQN 虽然强大,但存在一些固有限制:

  1. 只能处理离散动作:max 操作无法直接应用于连续动作空间
  2. 策略可能不稳定:价值函数的微小变化可能导致策略剧变
  3. 难以学习随机策略:最优策略可能是随机的,但基于价值的方法倾向于确定性策略

策略梯度的优势

策略梯度方法具有以下优势:

  1. 自然处理连续动作:直接输出动作概率或值
  2. 学习随机策略:输出动作概率分布
  3. 更好的收敛性:策略更新更平滑
  4. 无需价值函数近似:某些情况下更简单

策略的参数化表示

策略 πθ(as)\pi_\theta(a|s) 用参数 θ\theta 表示,输出在状态 ss 下选择动作 aa 的概率。

离散动作空间

使用神经网络输出动作概率分布:

import torch
import torch.nn as nn
import torch.nn.functional as F

class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)

def forward(self, x):
return self.network(x)

def select_action(self, state):
probs = self.forward(state)
action = torch.multinomial(probs, 1)
return action.item(), probs

连续动作空间

对于连续动作,通常使用高斯策略:

πθ(as)=N(μθ(s),σθ(s))\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))

class GaussianPolicy(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.mean = nn.Linear(hidden_dim, action_dim)
self.log_std = nn.Parameter(torch.zeros(action_dim))

def forward(self, x):
features = self.shared(x)
mean = self.mean(features)
std = torch.exp(self.log_std)
return mean, std

def select_action(self, state):
mean, std = self.forward(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum(-1)
return action, log_prob

策略梯度定理

策略梯度定理是策略梯度方法的理论基础,它给出了目标函数对策略参数的梯度:

θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)]

理解策略梯度定理

这个公式告诉我们:

  1. 目标:最大化期望回报 J(θ)=Eπθ[Gt]J(\theta) = \mathbb{E}_{\pi_\theta}[G_t]
  2. 梯度方向:沿着 θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) 方向更新参数
  3. 权重:使用 Q(s,a)Q(s,a) 作为权重,好的动作获得更大的更新

直观理解

  • 如果动作 aa 在状态 ss 下表现好(高 Q 值),增加选择该动作的概率
  • 如果动作表现差(低 Q 值),降低选择该动作的概率
  • 更新幅度与动作的价值成正比

REINFORCE 算法

REINFORCE 是最基础的策略梯度算法,也称为蒙特卡洛策略梯度。

算法思想

使用完整的回合回报 GtG_t 作为 Q(s,a)Q(s,a) 的无偏估计:

θJ(θ)tθlogπθ(atst)Gt\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) G_t

算法实现

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque

class REINFORCE:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
self.policy = PolicyNetwork(state_dim, action_dim, hidden_dim)
self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.gamma = gamma
self.saved_log_probs = []
self.rewards = []

def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
self.saved_log_probs.append(dist.log_prob(action))
return action.item()

def update(self):
R = 0
returns = deque()

for r in self.rewards[::-1]:
R = r + self.gamma * R
returns.appendleft(R)

returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)

policy_loss = []
for log_prob, R in zip(self.saved_log_probs, returns):
policy_loss.append(-log_prob * R)

self.optimizer.zero_grad()
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward()
self.optimizer.step()

self.saved_log_probs = []
self.rewards = []

return policy_loss.item()

def train(self, env, num_episodes=1000):
rewards_history = []

for episode in range(num_episodes):
state, _ = env.reset()
total_reward = 0
done = False

while not done:
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated

self.rewards.append(reward)
state = next_state
total_reward += reward

loss = self.update()
rewards_history.append(total_reward)

if (episode + 1) % 50 == 0:
avg_reward = np.mean(rewards_history[-50:])
print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")

return rewards_history

基线减法

为了减少方差,可以引入基线 b(s)b(s)

θJ(θ)=E[θlogπθ(as)(Q(s,a)b(s))]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) (Q(s,a) - b(s))]

常用的基线是状态价值函数 V(s)V(s)

class REINFORCEWithBaseline:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
self.policy = PolicyNetwork(state_dim, action_dim, hidden_dim)
self.value = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)

self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.value_optimizer = optim.Adam(self.value.parameters(), lr=lr)

self.gamma = gamma
self.saved_log_probs = []
self.saved_values = []
self.rewards = []

def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state)
value = self.value(state)

dist = torch.distributions.Categorical(probs)
action = dist.sample()

self.saved_log_probs.append(dist.log_prob(action))
self.saved_values.append(value)

return action.item()

def update(self):
R = 0
returns = []

for r in self.rewards[::-1]:
R = r + self.gamma * R
returns.insert(0, R)

returns = torch.tensor(returns, dtype=torch.float32)

policy_loss = []
value_loss = []

for log_prob, value, R in zip(self.saved_log_probs, self.saved_values, returns):
advantage = R - value.item()
policy_loss.append(-log_prob * advantage)
value_loss.append(nn.MSELoss()(value.squeeze(), torch.tensor(R)))

self.policy_optimizer.zero_grad()
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward()
self.policy_optimizer.step()

self.value_optimizer.zero_grad()
value_loss = torch.stack(value_loss).sum()
value_loss.backward()
self.value_optimizer.step()

self.saved_log_probs = []
self.saved_values = []
self.rewards = []

策略梯度的优缺点

优点

  1. 连续动作空间:自然处理连续动作
  2. 随机策略:可以学习最优随机策略
  3. 平滑更新:策略变化更平滑
  4. 探索性:随机策略自带探索

缺点

  1. 高方差:梯度估计方差大,需要大量样本
  2. 样本效率低:每个样本只用一次
  3. 收敛慢:可能收敛到局部最优
  4. 不稳定:训练过程可能不稳定

常见技巧

奖励归一化

def normalize_rewards(rewards):
rewards = np.array(rewards)
return (rewards - rewards.mean()) / (rewards.std() + 1e-8)

梯度裁剪

torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)

学习率调度

scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)

示例:CartPole

import gymnasium as gym

env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

agent = REINFORCE(state_dim, action_dim, hidden_dim=128, lr=1e-2, gamma=0.99)
rewards = agent.train(env, num_episodes=1000)

import matplotlib.pyplot as plt
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('REINFORCE on CartPole')
plt.show()

小结

策略梯度方法是强化学习的重要分支:

  • 核心思想:直接优化策略参数
  • 策略梯度定理θJ(θ)=E[θlogπθ(as)Q(s,a)]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) Q(s,a)]
  • REINFORCE:使用蒙特卡洛回报估计梯度
  • 基线减法:减少方差的重要技巧
  • 优势:处理连续动作、学习随机策略
  • 劣势:高方差、样本效率低

下一章将介绍 Actor-Critic 方法,它结合了策略梯度和价值函数的优点,能够更高效地学习。