跳到主要内容

强化学习速查表

本文档提供强化学习常用概念、公式和 API 的快速参考。适合在学习或实践中快速查阅。

核心概念速查

MDP 五元组

元素符号说明示例
状态空间SS所有可能状态的集合棋盘所有局面
动作空间AA所有可能动作的集合上下左右移动
转移概率$P(s's,a)$状态转移概率
奖励函数R(s,a,s)R(s,a,s')即时奖励得分、惩罚
折扣因子γ\gamma未来奖励的折扣通常 0.99

回报与价值函数

折扣回报Gt=k=0γkRt+k+1=Rt+1+γRt+2+γ2Rt+3+...G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...

状态价值函数Vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]V^\pi(s) = \mathbb{E}_\pi\left[G_t | S_t = s\right] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s\right]

动作价值函数Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s,a) = \mathbb{E}_\pi\left[G_t | S_t = s, A_t = a\right]

优势函数A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s)

贝尔曼方程

贝尔曼期望方程(状态价值)Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]

贝尔曼期望方程(动作价值)Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]

贝尔曼最优方程(状态价值)V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]

贝尔曼最优方程(动作价值)Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

算法公式速查

表格型方法

Q-Learning

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

  • 类型:离线策略(Off-Policy)
  • 特点:学习最优策略的价值
  • 适用:离散状态和动作空间

SARSA

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s',a') - Q(s,a)]

  • 类型:在线策略(On-Policy)
  • 特点:学习当前策略的价值
  • 适用:需要安全性的场景

预期 SARSA

Q(s,a)Q(s,a)+α[r+γaπ(as)Q(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \sum_{a'} \pi(a'|s') Q(s',a') - Q(s,a)]

  • 类型:离线策略
  • 特点:方差比 SARSA 小

深度强化学习

DQN 损失函数

L(θ)=E[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]L(\theta) = \mathbb{E}\left[(r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta))^2\right]

其中 θ\theta^- 是目标网络参数。

Double DQN

y=r+γQ(s,argmaxaQ(s,a;θ);θ)y = r + \gamma Q(s', \arg\max_{a'} Q(s',a';\theta); \theta^-)

Dueling DQN

Q(s,a)=V(s)+A(s,a)Q(s,a) = V(s) + A(s,a)

其中: A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s)

策略梯度方法

策略梯度定理

θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)]

REINFORCE

θJ(θ)1Ni=1Nt=0Tθlogπθ(at(i)st(i))Gt(i)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) G_t^{(i)}

带基线的策略梯度

θJ(θ)=E[θlogπθ(as)(Q(s,a)b(s))]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) (Q(s,a) - b(s))]

常用基线:b(s)=V(s)b(s) = V(s)

Actor-Critic 损失

Lactor=logπ(as)A(s,a)L_{actor} = -\log \pi(a|s) \cdot A(s,a)

Lcritic=(V(s)R)2L_{critic} = (V(s) - R)^2

PPO-Clip

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}\left[\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)\right]

其中 rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} 是重要性权重。

SAC(Soft Actor-Critic)

SAC 的核心是最大熵强化学习,目标函数为:

J(π)=Eτπ[t=0γt(R(st,at)+αH(π(st)))]J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t \left(R(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\right)\right]

Q 网络目标值:

y=r+γ(minj=1,2Qϕtarg,j(s,a~)αlogπθ(a~s))y = r + \gamma \left(\min_{j=1,2} Q_{\phi_{targ,j}}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}'|s')\right)

策略损失:

Lπ(θ)=E[αlogπθ(a~θ(s,ξ)s)minj=1,2Qϕj(s,a~θ(s,ξ))]L_\pi(\theta) = \mathbb{E}\left[\alpha \log \pi_\theta(\tilde{a}_\theta(s,\xi)|s) - \min_{j=1,2} Q_{\phi_j}(s, \tilde{a}_\theta(s,\xi))\right]

  • 类型:离线策略(Off-Policy)
  • 特点:熵正则化、自动探索、样本效率高
  • 适用:连续动作空间

GAE(广义优势估计)

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^{\infty}(\gamma \lambda)^l \delta_{t+l}

其中 δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) 是 TD 误差。

连续动作空间

高斯策略

πθ(as)=12πσexp((aμθ(s))22σ2)\pi_\theta(a|s) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(a-\mu_\theta(s))^2}{2\sigma^2}\right)

SAC(Soft Actor-Critic)

J(π)=EsD,aπ[αlogπ(as)Q(s,a)]J(\pi) = \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi}[\alpha \log \pi(a|s) - Q(s,a)]

熵正则化目标: π=argmaxπE[tγt(r(st,at)+αH(π(st)))]\pi^* = \arg\max_\pi \mathbb{E}[\sum_t \gamma^t (r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)))]

Gymnasium API 速查

基本使用

import gymnasium as gym

env = gym.make('CartPole-v1')
obs, info = env.reset(seed=42)
obs, reward, terminated, truncated, info = env.step(action)
env.close()

step() 返回值

返回值类型说明
observationnp.ndarray新的观测
rewardfloat即时奖励
terminatedbool是否自然终止
truncatedbool是否被截断
infodict辅助信息

空间类型

from gymnasium import spaces

spaces.Box(low=0, high=1, shape=(4,), dtype=np.float32)
spaces.Discrete(2)
spaces.MultiDiscrete([2, 3, 2])
spaces.MultiBinary(4)
spaces.Dict({'obs': spaces.Box(...), 'action': spaces.Discrete(2)})
spaces.Tuple((spaces.Box(...), spaces.Discrete(3)))

常用包装器

from gymnasium.wrappers import (
TimeLimit, RecordVideo, NormalizeObservation,
NormalizeReward, RecordEpisodeStatistics,
FrameStackObservation, TransformObservation
)

env = TimeLimit(env, max_episode_steps=500)
env = RecordVideo(env, video_folder='./videos')
env = NormalizeObservation(env)
env = NormalizeReward(env, gamma=0.99)
env = RecordEpisodeStatistics(env)
env = FrameStackObservation(env, stack_size=4)

向量化环境

from gymnasium.vector import SyncVectorEnv, AsyncVectorEnv, make_vec_env

envs = SyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(4)])
envs = AsyncVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(4)])
envs = make_vec_env('CartPole-v1', n_envs=4, parallel=True)

Stable Baselines3 API 速查

创建模型

from stable_baselines3 import PPO, DQN, SAC, TD3, A2C

model = PPO('MlpPolicy', env, learning_rate=3e-4, verbose=1)
model = DQN('MlpPolicy', env, learning_rate=1e-4, buffer_size=100000)
model = SAC('MlpPolicy', env, learning_rate=3e-4)
model = TD3('MlpPolicy', env, learning_rate=3e-4)
model = A2C('MlpPolicy', env, learning_rate=7e-4)

model = PPO('CnnPolicy', env)

训练和预测

model.learn(total_timesteps=100000)

action, _states = model.predict(observation, deterministic=True)

action, _states = model.predict(observation, deterministic=False)

保存和加载

model.save('ppo_cartpole')
model = PPO.load('ppo_cartpole', env=env)

model.save_replay_buffer('replay_buffer')
model.load_replay_buffer('replay_buffer')

回调函数

from stable_baselines3.common.callbacks import (
EvalCallback, CheckpointCallback, CallbackList, BaseCallback
)

eval_callback = EvalCallback(
eval_env,
best_model_save_path='./best/',
log_path='./logs/',
eval_freq=10000,
n_eval_episodes=5,
deterministic=True
)

checkpoint_callback = CheckpointCallback(
save_freq=10000,
save_path='./checkpoints/',
name_prefix='model'
)

model.learn(total_timesteps=100000, callback=CallbackList([
eval_callback, checkpoint_callback
]))

自定义策略网络

import torch.nn as nn

policy_kwargs = dict(
net_arch=[64, 64],
activation_fn=nn.ReLU,
optimizer_class=torch.optim.Adam,
optimizer_kwargs=dict(lr=3e-4)
)

model = PPO('MlpPolicy', env, policy_kwargs=policy_kwargs)

超参数推荐

PPO

参数推荐值说明
learning_rate3e-4学习率
n_steps2048每次更新收集的步数
batch_size64小批量大小
n_epochs10每次更新的训练轮数
gamma0.99折扣因子
gae_lambda0.95GAE 参数
clip_range0.2PPO 裁剪参数
ent_coef0.01熵系数
vf_coef0.5价值函数系数

DQN

参数推荐值说明
learning_rate1e-4学习率
buffer_size100000经验回放缓冲区大小
batch_size32 或 64小批量大小
gamma0.99折扣因子
target_update_interval1000 ~ 10000目标网络更新频率
exploration_fraction0.1探索比例
exploration_final_eps0.01最终探索率
train_freq4训练频率
gradient_steps1每次训练的梯度步数

SAC

参数推荐值说明
learning_rate3e-4学习率
buffer_size1000000经验回放缓冲区大小
batch_size256小批量大小
gamma0.99折扣因子
tau0.005软更新系数
ent_coef'auto'熵系数(自动调整)
target_update_interval1目标网络更新频率
gradient_steps1每次训练的梯度步数

TD3

参数推荐值说明
learning_rate3e-4学习率
buffer_size1000000经验回放缓冲区大小
batch_size256小批量大小
gamma0.99折扣因子
tau0.005软更新系数
policy_delay2策略更新延迟
target_policy_noise0.2目标策略平滑噪声
target_noise_clip0.5目标噪声裁剪

PyTorch 常用代码

策略网络

import torch
import torch.nn as nn
from torch.distributions import Categorical

class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)

def forward(self, x):
return self.network(x)

def get_action(self, x):
probs = self.forward(x)
dist = Categorical(probs)
action = dist.sample()
return action, dist.log_prob(action), dist.entropy()

价值网络

class ValueNetwork(nn.Module):
def __init__(self, state_dim, hidden_dim=128):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)

def forward(self, x):
return self.network(x)

经验回放

from collections import deque
import random
import numpy as np

class ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = deque(maxlen=capacity)

def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))

def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (
np.array(states),
np.array(actions),
np.array(rewards, dtype=np.float32),
np.array(next_states),
np.array(dones, dtype=np.float32)
)

def __len__(self):
return len(self.buffer)

GAE 计算

def compute_gae(rewards, values, next_value, dones, gamma=0.99, lam=0.95):
advantages = []
gae = 0

for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_val = next_value
else:
next_val = values[t + 1]

delta = rewards[t] + gamma * next_val * (1 - dones[t]) - values[t]
gae = delta + gamma * lam * (1 - dones[t]) * gae
advantages.insert(0, gae)

returns = [a + v for a, v in zip(advantages, values)]
return advantages, returns

折扣回报计算

def compute_returns(rewards, gamma=0.99):
returns = []
R = 0
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R)
return returns

常见环境

经典控制

环境动作空间观测空间说明
CartPole-v1Discrete(2)Box(4,)倒立摆
MountainCar-v0Discrete(3)Box(2,)山地车
Acrobot-v1Discrete(3)Box(6,)欠驱动摆
Pendulum-v1Box(1,)Box(3,)钟摆(连续动作)

Box2D

环境动作空间观测空间说明
LunarLander-v2Discrete(4)Box(8,)月球着陆
LunarLanderContinuous-v2Box(2,)Box(8,)月球着陆(连续)
BipedalWalker-v3Box(4,)Box(24,)双足行走

MuJoCo

环境动作空间观测空间说明
HalfCheetah-v4Box(6,)Box(17,)半猎豹
Humanoid-v4Box(17,)Box(376,)人形机器人
Ant-v4Box(8,)Box(111,)蚂蚁
Reacher-v4Box(2,)Box(11,)机械臂
Hopper-v4Box(3,)Box(11,)单腿跳跃

算法选择指南

场景推荐算法原因
离散动作,简单任务DQN简单高效
离散动作,复杂任务PPO稳定可靠
连续动作,简单任务SAC样本效率高
连续动作,复杂任务PPO / SAC / TD3根据具体任务选择
样本效率要求高SAC / TD3离线策略,可重用数据
训练稳定性要求高PPO更新步长可控
多进程训练PPO / A2C支持向量化环境
稀疏奖励HER + DQN/SAC目标重标记

调试技巧

检查环境

from gymnasium.utils.env_checker import check_env

check_env(env, warn=True)

监控训练

from stable_baselines3.common.callbacks import BaseCallback

class TensorboardCallback(BaseCallback):
def __init__(self, verbose=0):
super().__init__(verbose)

def _on_step(self):
if self.n_calls % 100 == 0:
self.logger.record('custom/episode_reward',
self.locals.get('episode_reward', 0))
return True

梯度检查

for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: grad_mean={param.grad.mean():.6f}, "
f"grad_std={param.grad.std():.6f}")

常见问题排查

问题可能原因解决方案
奖励不增长学习率太大/太小调整学习率
奖励震荡剧烈更新步长太大减小学习率或增加 batch_size
奖励突然下降策略崩溃使用 PPO 或减小更新幅度
探索不足ε 衰减太快减慢 ε 衰减
过拟合训练太久早停或正则化

参考资源

官方文档

经典教材

经典论文

  • DQN: Playing Atari with Deep Reinforcement Learning (2013)
  • Double DQN: Deep Reinforcement Learning with Double Q-learning (2015)
  • Dueling DQN: Dueling Network Architectures for Deep RL (2016)
  • PPO: Proximal Policy Optimization Algorithms (2017)
  • SAC: Soft Actor-Critic (2018)
  • TD3: Twin Delayed DDPG (2018)