自动微分

自动微分（Automatic Differentiation）是深度学习的核心技术之一。TensorFlow 提供了 tf.GradientTape 来实现自动求导。

梯度计算基础

基本用法

tf.GradientTape 记录计算过程，然后通过反向传播计算梯度：

import tensorflow as tf

x = tf.Variable(3.0)

with tf.GradientTape() as tape:
    y = x ** 2

dy_dx = tape.gradient(y, x)
print(dy_dx)  # tf.Tensor(6.0, shape=(), dtype=float32)

这个例子计算的是 $y = x^2$ 在 $x = 3$ 处的导数。根据求导法则， $\frac{dy}{dx} = 2x = 6$ 。

计算过程解析

x = tf.Variable(2.0)
y = tf.Variable(3.0)

with tf.GradientTape() as tape:
    z = x * y
    loss = z ** 2

# 计算 loss 对 x 和 y 的梯度
grads = tape.gradient(loss, [x, y])
print("dz/dx:", grads[0])  # 36.0 = 2 * z * y = 2 * 6 * 3
print("dz/dy:", grads[1])  # 24.0 = 2 * z * x = 2 * 6 * 2

GradientTape 的控制

监视变量

默认情况下，GradientTape 只监视 tf.Variable。如果需要监视普通张量，需要显式调用 watch：

x = tf.constant(3.0)  # 不是 Variable

with tf.GradientTape() as tape:
    tape.watch(x)  # 显式监视
    y = x ** 2

dy_dx = tape.gradient(y, x)
print(dy_dx)  # tf.Tensor(6.0, ...)

persistent 参数

默认情况下，调用 gradient 后，GradientTape 会释放资源。如果需要多次计算梯度，需要设置 persistent=True：

x = tf.Variable(3.0)

with tf.GradientTape(persistent=True) as tape:
    y = x ** 2
    z = x ** 3

dy_dx = tape.gradient(y, x)  # 6.0
dz_dx = tape.gradient(z, x)  # 27.0

del tape  # 手动释放资源

停止梯度

使用 tf.stop_gradient 可以阻止梯度传播：

x = tf.Variable(2.0)

with tf.GradientTape() as tape:
    y = x ** 2
    y_stop = tf.stop_gradient(y)  # 停止梯度
    z = y_stop * x

dz_dx = tape.gradient(z, x)
print(dz_dx)  # 4.0（只有 x 的直接贡献，y 的梯度被阻断）

多变量梯度计算

计算多个梯度

# 多个变量
x = tf.Variable([[1.0, 2.0], [3.0, 4.0]])
w = tf.Variable([[5.0], [6.0]])
b = tf.Variable([[1.0]])

with tf.GradientTape() as tape:
    y = x @ w + b
    loss = tf.reduce_mean(y ** 2)

# 计算对所有变量的梯度
grads = tape.gradient(loss, [x, w, b])
print("dx:", grads[0])
print("dw:", grads[1])
print("db:", grads[2])

使用字典存储梯度

x = tf.Variable(2.0)
y = tf.Variable(3.0)

with tf.GradientTape() as tape:
    z = x ** 2 + y ** 2

grads = tape.gradient(z, {'x': x, 'y': y})
print(grads['x'])  # 4.0
print(grads['y'])  # 6.0

高阶导数

通过嵌套 GradientTape 可以计算高阶导数：

x = tf.Variable(3.0)

with tf.GradientTape() as tape1:
    with tf.GradientTape() as tape2:
        y = x ** 3
    dy_dx = tape2.gradient(y, x)  # 一阶导数：3x^2 = 27
d2y_dx2 = tape1.gradient(dy_dx, x)  # 二阶导数：6x = 18

print("一阶导数:", dy_dx)   # 27.0
print("二阶导数:", d2y_dx2)  # 18.0

实际应用：线性回归

下面使用自动微分实现一个简单的线性回归：

import tensorflow as tf
import numpy as np

# 生成数据
np.random.seed(42)
X = np.random.rand(100).astype(np.float32)
y = 3 * X + 2 + np.random.randn(100) * 0.1

# 初始化参数
W = tf.Variable(0.0)
b = tf.Variable(0.0)

# 训练参数
learning_rate = 0.1
epochs = 100

# 训练循环
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        # 前向计算
        y_pred = W * X + b
        # 计算损失（均方误差）
        loss = tf.reduce_mean((y_pred - y) ** 2)
    
    # 计算梯度
    gradients = tape.gradient(loss, [W, b])
    
    # 更新参数
    W.assign_sub(learning_rate * gradients[0])
    b.assign_sub(learning_rate * gradients[1])
    
    if epoch % 20 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}, W = {W:.4f}, b = {b:.4f}")

print(f"\n最终结果: W = {W:.4f}, b = {b:.4f}")
# 最终结果应该接近 W=3, b=2

自定义梯度

使用 tf.custom_gradient 可以自定义梯度计算：

@tf.custom_gradient
def clip_gradient(x):
    def grad(upstream):
        # 将梯度裁剪到 [-1, 1] 范围
        return tf.clip_by_value(upstream, -1, 1)
    return x, grad

x = tf.Variable(10.0)

with tf.GradientTape() as tape:
    y = clip_gradient(x * 100)

print(tape.gradient(y, x))  # 1.0（被裁剪）

梯度裁剪

在训练深度网络时，梯度可能会变得非常大，导致训练不稳定。梯度裁剪可以解决这个问题：

# 生成一些梯度
gradients = [tf.constant([10.0, 20.0, 30.0])]

# 按值裁剪
clipped = [tf.clip_by_value(g, -1.0, 1.0) for g in gradients]
print(clipped)  # [[1.0, 1.0, 1.0]]

# 按范数裁剪
clipped = tf.clip_by_global_norm(gradients, clip_norm=1.0)
print(clipped[0])  # 梯度会被缩放，使得全局范数不超过 1.0

# 按平均范数裁剪
clipped = tf.clip_by_average_norm(gradients, clip_norm=1.0)

Jacobian 和 Hessian

Jacobian 矩阵

Jacobian 矩阵是多变量函数的一阶偏导数矩阵：

x = tf.Variable([1.0, 2.0])

with tf.GradientTape(persistent=True) as tape:
    y = [x[0] ** 2 + x[1], x[0] * x[1]]

# 计算 Jacobian
jacobian = tape.jacobian(y, x)
print(jacobian)
# [[2. 1.]  (y[0] 对 x 的偏导)
#  [2. 1.]] (y[1] 对 x 的偏导)

Hessian 矩阵

Hessian 矩阵是二阶偏导数矩阵：

x = tf.Variable([1.0, 2.0])

def f(x):
    return x[0] ** 2 + x[0] * x[1] + x[1] ** 2

with tf.GradientTape() as tape1:
    with tf.GradientTape() as tape2:
        y = f(x)
    grad = tape2.gradient(y, x)
hessian = tape1.jacobian(grad, x)

print("梯度:", grad)      # [4. 5.]
print("Hessian:", hessian)
# [[2. 1.]
#  [1. 2.]]

常见问题

梯度为 None

如果梯度返回 None，可能的原因：

变量没有被监视：确保使用 tf.Variable 或显式 watch
张量在 tape 外创建：确保计算过程在 with 块内
使用了非 TensorFlow 操作：如 print()、Python 控制流

x = tf.Variable(3.0)

# 错误示例
with tf.GradientTape() as tape:
    y = x ** 2
    z = y + 1  # z 在 tape 外计算
loss = z ** 2

# 正确示例
with tf.GradientTape() as tape:
    y = x ** 2
    z = y + 1
    loss = z ** 2

grad = tape.gradient(loss, x)
print(grad)  # 正确计算梯度

整数类型的梯度

整数类型不支持梯度计算：

x = tf.Variable(3)  # int32

with tf.GradientTape() as tape:
    y = x ** 2

print(tape.gradient(y, x))  # None

# 解决方案：使用浮点类型
x = tf.Variable(3.0)  # float32

小结

本章介绍了 TensorFlow 的自动微分机制：

使用 tf.GradientTape 计算梯度
控制梯度计算的监视范围
计算多变量和高阶梯度
自定义梯度函数
梯度裁剪技术
Jacobian 和 Hessian 矩阵

自动微分是深度学习框架的核心功能，理解其原理对于调试和优化模型非常重要。下一章我们将学习如何使用 Keras 构建神经网络模型。

梯度计算基础​

基本用法​

计算过程解析​

GradientTape 的控制​

监视变量​

persistent 参数​

停止梯度​

多变量梯度计算​

计算多个梯度​

使用字典存储梯度​

高阶导数​

实际应用：线性回归​

自定义梯度​

梯度裁剪​

Jacobian 和 Hessian​

Jacobian 矩阵​

Hessian 矩阵​

常见问题​

梯度为 None​

整数类型的梯度​

小结​