策略梯度方法
策略梯度方法是一类直接优化策略函数的强化学习算法。与基于价值的方法不同,策略梯度不学习价值函数,而是直接学习策略参数,使其能够选择更好的动作。
为什么需要策略梯度?
基于价值方法的局限
Q-Learning 和 DQN 虽然强大,但存在一些固有限制:
- 只能处理离散动作:max 操作无法直接应用于连续动作空间
- 策略可能不稳定:价值函数的微小变化可能导致策略剧变
- 难以学习随机策略:最优策略可能是随机的,但基于价值的方法倾向于确定性策略
策略梯度的优势
策略梯度方法具有以下优势:
- 自然处理连续动作:直接输出动作概率或值
- 学习随机策略:输出动作概率分布
- 更好的收敛性:策略更新更平滑
- 无需价值函数近似:某些情况下更简单
策略的参数化表示
策略 用参数 表示,输出在状态 下选择动作 的概率。
离散动作空间
使用神经网络输出动作概率分布:
import torch
import torch.nn as nn
import torch.nn.functional as F
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)
def forward(self, x):
return self.network(x)
def select_action(self, state):
probs = self.forward(state)
action = torch.multinomial(probs, 1)
return action.item(), probs
连续动作空间
对于连续动作,通常使用高斯策略:
class GaussianPolicy(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.mean = nn.Linear(hidden_dim, action_dim)
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, x):
features = self.shared(x)
mean = self.mean(features)
std = torch.exp(self.log_std)
return mean, std
def select_action(self, state):
mean, std = self.forward(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum(-1)
return action, log_prob
策略梯度定理
策略梯度定理是策略梯度方法的理论基础,它给出了目标函数对策略参数的梯度:
理解策略梯度定理
这个公式告诉我们:
- 目标:最大化期望回报
- 梯度方向:沿着 方向更新参数
- 权重:使用 作为权重,好的动作获得更大的更新
直观理解
- 如果动作 在状态 下表现好(高 Q 值),增加选择该动作的概率
- 如果动作表现差(低 Q 值),降低选择该动作的概率
- 更新幅度与动作的价值成正比
REINFORCE 算法
REINFORCE 是最基础的策略梯度算法,也称为蒙特卡洛策略梯度。
算法思想
使用完整的回合回报 作为 的无偏估计:
算法实现
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
class REINFORCE:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
self.policy = PolicyNetwork(state_dim, action_dim, hidden_dim)
self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.gamma = gamma
self.saved_log_probs = []
self.rewards = []
def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
self.saved_log_probs.append(dist.log_prob(action))
return action.item()
def update(self):
R = 0
returns = deque()
for r in self.rewards[::-1]:
R = r + self.gamma * R
returns.appendleft(R)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
policy_loss = []
for log_prob, R in zip(self.saved_log_probs, returns):
policy_loss.append(-log_prob * R)
self.optimizer.zero_grad()
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward()
self.optimizer.step()
self.saved_log_probs = []
self.rewards = []
return policy_loss.item()
def train(self, env, num_episodes=1000):
rewards_history = []
for episode in range(num_episodes):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
self.rewards.append(reward)
state = next_state
total_reward += reward
loss = self.update()
rewards_history.append(total_reward)
if (episode + 1) % 50 == 0:
avg_reward = np.mean(rewards_history[-50:])
print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")
return rewards_history
基线减法
为了减少方差,可以引入基线 :
常用的基线是状态价值函数 :
class REINFORCEWithBaseline:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
self.policy = PolicyNetwork(state_dim, action_dim, hidden_dim)
self.value = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.value_optimizer = optim.Adam(self.value.parameters(), lr=lr)
self.gamma = gamma
self.saved_log_probs = []
self.saved_values = []
self.rewards = []
def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state)
value = self.value(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
self.saved_log_probs.append(dist.log_prob(action))
self.saved_values.append(value)
return action.item()
def update(self):
R = 0
returns = []
for r in self.rewards[::-1]:
R = r + self.gamma * R
returns.insert(0, R)
returns = torch.tensor(returns, dtype=torch.float32)
policy_loss = []
value_loss = []
for log_prob, value, R in zip(self.saved_log_probs, self.saved_values, returns):
advantage = R - value.item()
policy_loss.append(-log_prob * advantage)
value_loss.append(nn.MSELoss()(value.squeeze(), torch.tensor(R)))
self.policy_optimizer.zero_grad()
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward()
self.policy_optimizer.step()
self.value_optimizer.zero_grad()
value_loss = torch.stack(value_loss).sum()
value_loss.backward()
self.value_optimizer.step()
self.saved_log_probs = []
self.saved_values = []
self.rewards = []
策略梯度的优缺点
优点
- 连续动作空间:自然处理连续动作
- 随机策略:可以学习最优随机策略
- 平滑更新:策略变化更平滑
- 探索性:随机策略自带探索
缺点
- 高方差:梯度估计方差大,需要大量样本
- 样本效率低:每个样本只用一次
- 收敛慢:可能收敛到局部最优
- 不稳定:训练过程可能不稳定
常见技巧
奖励归一化
def normalize_rewards(rewards):
rewards = np.array(rewards)
return (rewards - rewards.mean()) / (rewards.std() + 1e-8)
梯度裁剪
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)
学习率调度
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)
示例:CartPole
import gymnasium as gym
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = REINFORCE(state_dim, action_dim, hidden_dim=128, lr=1e-2, gamma=0.99)
rewards = agent.train(env, num_episodes=1000)
import matplotlib.pyplot as plt
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('REINFORCE on CartPole')
plt.show()
小结
策略梯度方法是强化学习的重要分支:
- 核心思想:直接优化策略参数
- 策略梯度定理:
- REINFORCE:使用蒙特卡洛回报估计梯度
- 基线减法:减少方差的重要技巧
- 优势:处理连续动作、学习随机策略
- 劣势:高方差、样本效率低
下一章将介绍 Actor-Critic 方法,它结合了策略梯度和价值函数的优点,能够更高效地学习。