卷积神经网络 (CNN)

卷积神经网络（Convolutional Neural Network，CNN）是深度学习中最成功的模型架构之一，特别适用于图像和视频等视觉数据的处理。与传统全连接网络不同，CNN 利用图像的空间结构，通过局部连接和权重共享高效地提取特征。

卷积的基本概念

什么是卷积？

卷积是一种数学运算，在图像处理中用于提取特征。想象一下，你用一个小的"窗口"在图像上滑动，每次只看图像的一小部分，然后计算这个窗口内像素的加权和。这个"窗口"就是卷积核（kernel），也叫滤波器（filter）。

举个例子：假设有一个 5×5 的图像和一个 3×3 的卷积核。卷积核从图像左上角开始，覆盖图像的前 3×3 区域，计算这 9 个像素与卷积核对应位置的乘积之和，得到输出矩阵的一个值。然后卷积核向右移动一格（这个移动距离叫步长），继续计算，直到覆盖整个图像。

输入图像 (5×5)         卷积核 (3×3)         输出 (3×3)

1 1 0 0             1 0 1              
1 1 1 0             1 1 1              计算过程：
0 1 1 1    ×        0 1 0              
0 1 0 0                                  (1×1+1×0+1×1 +
1 0 1 0                                   0×1+1×1+1×1 +
                                            0×0+0×1+1×0) = 4

CNN 的三大核心思想

CNN 的设计基于三个关键思想，使其特别适合处理图像数据：

局部连接：每个神经元只与上一层的一个局部区域相连，而不是全部连接。这符合图像的特点——图像的局部像素之间相关性更强。比如识别眼睛时，只需要关注眼睛附近的像素，不需要看整张图。

权重共享：同一个卷积核在整个图像上滑动，使用相同的权重。这意味着无论眼睛出现在图像的左上角还是右下角，同一个卷积核都能检测到。这大大减少了参数数量，也让模型具有平移不变性。

层次特征提取：CNN 通过堆叠多个卷积层，逐层提取从简单到复杂的特征。浅层可能检测边缘、颜色，深层则能识别形状、物体部件，最终理解整个物体。

卷积层详解

nn.Conv2d 参数全解析

PyTorch 中 nn.Conv2d 是最常用的二维卷积层：

import torch
import torch.nn as nn

conv = nn.Conv2d(
    in_channels=3,         # 输入通道数（RGB图像为3，灰度图为1）
    out_channels=64,       # 输出通道数，等于卷积核的数量
    kernel_size=3,         # 卷积核大小，可以是整数或元组 (height, width)
    stride=1,              # 步长，卷积核每次移动的距离
    padding=0,             # 填充，在输入四周添加的像素数
    dilation=1,            # 空洞率，控制卷积核元素之间的间距
    groups=1,              # 分组数，用于分组卷积
    bias=True,             # 是否添加偏置
    padding_mode='zeros'   # 填充模式：'zeros', 'reflect', 'replicate', 'circular'
)

理解这些参数的含义和相互作用是掌握 CNN 的关键。

输出尺寸计算

卷积操作后，输出特征图的尺寸由以下公式决定：

$H_{out} = \left\lfloor \frac{H_{in} + 2 \times padding - dilation \times (kernel\_size - 1) - 1}{stride} \right\rfloor + 1$

这个公式看起来复杂，但可以通过一个简单的理解方式：输出尺寸等于输入尺寸减去卷积核覆盖范围，再除以步长。padding 的作用是补偿被卷积核"吃掉"的边缘。

# 示例：计算输出尺寸
x = torch.randn(1, 3, 32, 32)  # 输入：1张3通道32×32的图像

# 情况1：无填充，3×3卷积核
conv1 = nn.Conv2d(3, 64, kernel_size=3)
out1 = conv1(x)
# 输出尺寸 = (32 + 0 - 3 + 1) / 1 = 30
print(f"无填充输出: {out1.shape}")  # [1, 64, 30, 30]

# 情况2：填充1，保持尺寸不变
conv2 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
out2 = conv2(x)
# 输出尺寸 = (32 + 2 - 3 + 1) / 1 = 32
print(f"填充1输出: {out2.shape}")  # [1, 64, 32, 32]

# 情况3：步长为2，尺寸减半
conv3 = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
out3 = conv3(x)
# 输出尺寸 = (32 + 2 - 3 + 1) / 2 = 16
print(f"步长2输出: {out3.shape}")  # [1, 64, 16, 16]

PyTorch 还提供了便捷的 padding='same' 选项，自动计算合适的填充使输出尺寸与输入相同（仅当 stride=1 时有效）：

conv = nn.Conv2d(3, 64, kernel_size=3, padding='same')
out = conv(x)
print(f"same padding: {out.shape}")  # [1, 64, 32, 32]

填充 (Padding) 的作用

填充在输入图像四周添加像素，主要有两个目的：

控制输出尺寸：适当的填充可以使输出特征图与输入保持相同尺寸
保留边缘信息：没有填充时，图像边缘像素参与计算的次数少，容易被忽略

x = torch.randn(1, 3, 32, 32)

# 不同填充模式对比
# zeros: 用0填充（默认）
conv_zeros = nn.Conv2d(3, 64, kernel_size=3, padding=1, padding_mode='zeros')

# reflect: 镜像反射边缘像素填充
conv_reflect = nn.Conv2d(3, 64, kernel_size=3, padding=1, padding_mode='reflect')

# replicate: 复制边缘像素填充
conv_replicate = nn.Conv2d(3, 64, kernel_size=3, padding=1, padding_mode='replicate')

步长 (Stride) 的影响

步长控制卷积核每次移动的距离。步长越大，输出特征图越小。当步长为 2 时，输出尺寸大约减半（取决于是否使用填充）：

x = torch.randn(1, 3, 32, 32)

# 步长1：密集采样
conv_s1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
print(f"stride=1: {conv_s1(x).shape}")  # [1, 64, 32, 32]

# 步长2：下采样
conv_s2 = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
print(f"stride=2: {conv_s2(x).shape}")  # [1, 64, 16, 16]

# 步长3：更激进的下采样
conv_s3 = nn.Conv2d(3, 64, kernel_size=3, stride=3, padding=1)
print(f"stride=3: {conv_s3(x).shape}")  # [1, 64, 11, 11]

在实际应用中，步长为 1 配合池化层，或步长为 2 替代池化层，都是常见的设计选择。

高级卷积类型

除了标准卷积，还有几种特殊的卷积变体，各有其独特优势。

空洞卷积 (Dilated/Atrous Convolution)

空洞卷积在卷积核元素之间插入空洞，增大感受野而不增加参数数量。这在语义分割任务中特别有用，因为需要更大的感受野来捕获上下文信息。

# 标准卷积：dilation=1，感受野=3×3
conv_std = nn.Conv2d(3, 64, kernel_size=3, dilation=1)
# 实际覆盖范围：3×3

# 空洞卷积：dilation=2，感受野=5×5
conv_dil2 = nn.Conv2d(3, 64, kernel_size=3, dilation=2)
# 实际覆盖范围：5×5（卷积核元素间距为2）

# 空洞卷积：dilation=3，感受野=7×7
conv_dil3 = nn.Conv2d(3, 64, kernel_size=3, dilation=3)
# 实际覆盖范围：7×7

x = torch.randn(1, 3, 32, 32)
print(f"dilation=1: {conv_std(x).shape}")   # [1, 64, 30, 30]
print(f"dilation=2: {conv_dil2(x).shape}")  # [1, 64, 28, 28]
print(f"dilation=3: {conv_dil3(x).shape}")  # [1, 64, 26, 26]

感受野计算公式：

$RF = 1 + (kernel\_size - 1) \times dilation$

空洞卷积的优势在于：

不增加参数量
不降低分辨率
可以指数级扩大感受野

分组卷积 (Grouped Convolution)

分组卷积将输入通道分成若干组，每组独立进行卷积，最后拼接结果。当分组数等于输入通道数时，就是深度可分离卷积。

# 标准卷积：groups=1，所有输入通道参与所有输出通道的计算
conv_std = nn.Conv2d(3, 64, kernel_size=3, groups=1)
# 参数量：3 × 64 × 3 × 3 = 1728

# 分组卷积：groups=2，通道分成2组
# 要求 in_channels 和 out_channels 都能被 groups 整除
conv_grp2 = nn.Conv2d(4, 64, kernel_size=3, groups=2)
# 参数量：(4/2) × (64/2) × 3 × 3 × 2 = 576

x = torch.randn(1, 4, 32, 32)
print(f"groups=2: {conv_grp2(x).shape}")  # [1, 64, 30, 30]

分组卷积的优势：

减少参数量和计算量
每组学习不同的特征表示
ResNeXt 等网络架构的核心组件

深度可分离卷积 (Depthwise Separable Convolution)

深度可分离卷积是分组卷积的特例，将标准卷积分解为深度卷积和逐点卷积两步，大幅减少参数量和计算量。

class DepthwiseSeparableConv(nn.Module):
    """深度可分离卷积"""
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1):
        super().__init__()
        # 深度卷积：每个输入通道单独卷积
        self.depthwise = nn.Conv2d(
            in_channels, in_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            groups=in_channels,  # 关键：分组数=输入通道数
            bias=False
        )
        # 逐点卷积：1×1卷积整合通道信息
        self.pointwise = nn.Conv2d(
            in_channels, out_channels,
            kernel_size=1,
            bias=True
        )
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

# 对比标准卷积和深度可分离卷积的参数量
in_ch, out_ch = 64, 128

# 标准卷积参数量
std_params = in_ch * out_ch * 3 * 3  # = 73728

# 深度可分离卷积参数量
dw_params = in_ch * 3 * 3 + in_ch * out_ch * 1 * 1  # = 576 + 8192 = 8768

print(f"标准卷积参数量: {std_params}")
print(f"深度可分离卷积参数量: {dw_params}")
print(f"参数量减少: {(1 - dw_params/std_params)*100:.1f}%")  # 约88%

深度可分离卷积在 MobileNet 系列网络中被广泛使用，是轻量级网络的核心技术。

转置卷积 (Transposed Convolution)

转置卷积也称为反卷积（deconvolution），用于上采样，将低分辨率特征图放大到高分辨率。常用于语义分割、图像生成等任务。

# 转置卷积实现上采样
conv_trans = nn.ConvTranspose2d(
    in_channels=64,
    out_channels=32,
    kernel_size=4,
    stride=2,
    padding=1
)

x = torch.randn(1, 64, 16, 16)
out = conv_trans(x)
print(f"转置卷积输出: {out.shape}")  # [1, 32, 32, 32]，尺寸翻倍

# 输出尺寸计算公式：
# H_out = (H_in - 1) × stride - 2 × padding + kernel_size

需要注意的是，转置卷积可能产生"棋盘效应"（checkerboard artifacts），在实际应用中常用双线性插值+卷积的组合来替代。

池化层 (Pooling)

池化层用于降低特征图的空间分辨率，减少参数量，同时提供一定的平移不变性。

最大池化与平均池化

x = torch.randn(1, 64, 32, 32)

# 最大池化：取窗口内最大值
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
print(f"最大池化: {maxpool(x).shape}")  # [1, 64, 16, 16]

# 平均池化：取窗口内平均值
avgpool = nn.AvgPool2d(kernel_size=2, stride=2)
print(f"平均池化: {avgpool(x).shape}")  # [1, 64, 16, 16]

# 最大池化更倾向于保留显著特征（如边缘）
# 平均池化更平滑，保留背景信息

自适应池化

自适应池化自动计算步长和核大小，使输出达到指定尺寸，这在处理不同尺寸输入时非常方便：

x = torch.randn(1, 64, 32, 32)

# 自适应平均池化：输出固定为7×7
adaptive_avg = nn.AdaptiveAvgPool2d((7, 7))
print(f"自适应平均池化: {adaptive_avg(x).shape}")  # [1, 64, 7, 7]

# 全局平均池化：输出1×1，常用于替代全连接层
global_avg = nn.AdaptiveAvgPool2d(1)
print(f"全局平均池化: {global_avg(x).shape}")  # [1, 64, 1, 1]

# 自适应最大池化
adaptive_max = nn.AdaptiveMaxPool2d((7, 7))
print(f"自适应最大池化: {adaptive_max(x).shape}")  # [1, 64, 7, 7]

全局平均池化是现代 CNN 的标准组件，它将每个通道压缩为一个值，直接连接到分类器，避免了全连接层的大量参数。

感受野计算

感受野（Receptive Field）是指输出特征图上一个像素对应输入图像上的区域大小。理解感受野对于设计网络架构至关重要——如果感受野太小，网络可能无法"看到"完整的物体。

感受野计算公式

对于由 L 层卷积/池化层组成的网络，感受野大小的计算公式为：

$r_0 = \sum_{l=1}^{L} \left((k_l - 1) \times \prod_{i=1}^{l-1} s_i \right) + 1$

其中 $k_l$ 是第 l 层的卷积核大小， $s_i$ 是第 i 层的步长。

def calculate_receptive_field(layers):
    """
    计算感受野
    layers: 列表，每个元素是 (kernel_size, stride) 元组
    """
    rf = 1
    cumulative_stride = 1
    
    for kernel_size, stride in layers:
        rf += (kernel_size - 1) * cumulative_stride
        cumulative_stride *= stride
    
    return rf

# 示例：VGG风格网络
# 3个3×3卷积（stride=1），2个2×2池化（stride=2）
layers = [
    (3, 1),  # conv1
    (3, 1),  # conv2
    (2, 2),  # pool1
    (3, 1),  # conv3
    (3, 1),  # conv4
    (2, 2),  # pool2
]
rf = calculate_receptive_field(layers)
print(f"感受野大小: {rf}×{rf}")  # 感受野大小: 12×12

# 空洞卷积增大感受野的例子
layers_dilated = [
    (3, 1),  # 普通卷积，感受野=3
    (3, 1),  # 普通卷积，感受野=5
]
layers_standard = [
    (3, 1),  # 普通卷积，感受野=3
    (3, 1),  # 普通卷积，感受野=5
]
print(f"标准卷积感受野: {calculate_receptive_field(layers_standard)}")  # 5

感受野的实际意义

在实践中，设计网络时需要确保感受野足够覆盖待识别的物体：

目标检测：感受野应大于等于待检测物体
语义分割：每个像素的预测需要足够的上下文信息
图像分类：感受野应覆盖图像主体

堆叠多个小卷积核（如 3×3）比使用大卷积核更高效：

两个 3×3 卷积的感受野 = 5×5
三个 3×3 卷积的感受野 = 7×7
但参数量更少：3×3×2=18 vs 5×5=25，3×3×3=27 vs 7×7=49

经典 CNN 架构实现

LeNet-5

LeNet-5 是最早的卷积神经网络之一，由 Yann LeCun 在1998年提出，用于手写数字识别。虽然结构简单，但包含了 CNN 的核心组件。

class LeNet5(nn.Module):
    """LeNet-5：经典CNN架构
    
    结构特点：
    - 两个卷积层 + 两个池化层
    - 三个全连接层
    - 使用 sigmoid/tanh 激活（原版），现代实现常用 ReLU
    """
    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()
        
        # 卷积层
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)   # 28×28 -&gt; 24×24
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)  # 12×12 -&gt; 8×8
        
        # 池化层
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # 全连接层
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)
        
        self.relu = nn.ReLU()
    
    def forward(self, x):
        # 卷积块1: conv -&gt; relu -&gt; pool
        x = self.pool(self.relu(self.conv1(x)))  # 28-&gt;24-&gt;12
        
        # 卷积块2: conv -&gt; relu -&gt; pool
        x = self.pool(self.relu(self.conv2(x)))  # 12-&gt;8-&gt;4
        
        # 展平
        x = x.view(x.size(0), -1)
        
        # 全连接层
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        
        return x

# 测试
model = LeNet5(num_classes=10)
x = torch.randn(1, 1, 28, 28)  # MNIST图像尺寸
output = model(x)
print(f"LeNet-5 输出: {output.shape}")  # [1, 10]

VGGNet

VGGNet 的核心思想是使用小卷积核（3×3）堆叠来替代大卷积核，在保持相同感受野的同时减少参数量，并增加非线性。

class VGGBlock(nn.Module):
    """VGG基本块：多个3×3卷积 + 池化"""
    def __init__(self, in_channels, out_channels, num_convs):
        super().__init__()
        layers = []
        for i in range(num_convs):
            layers.append(nn.Conv2d(
                in_channels if i == 0 else out_channels,
                out_channels,
                kernel_size=3, padding=1
            ))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
        self.block = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.block(x)

class VGG16(nn.Module):
    """VGG-16：16层网络（13卷积+3全连接）
    
    结构特点：
    - 全部使用3×3卷积核
    - 每经过一个池化层，通道数翻倍
    - 参数量约1.38亿
    """
    def __init__(self, num_classes=1000):
        super().__init__()
        
        # 特征提取部分
        self.features = nn.Sequential(
            VGGBlock(3, 64, num_convs=2),      # 224-&gt;112
            VGGBlock(64, 128, num_convs=2),    # 112-&gt;56
            VGGBlock(128, 256, num_convs=3),   # 56-&gt;28
            VGGBlock(256, 512, num_convs=3),   # 28-&gt;14
            VGGBlock(512, 512, num_convs=3),   # 14-&gt;7
        )
        
        # 分类器部分
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

# 测试
model = VGG16(num_classes=10)
x = torch.randn(1, 3, 224, 224)
output = model(x)
print(f"VGG-16 输出: {output.shape}")  # [1, 10]

ResNet：残差网络

ResNet 是深度学习发展史上的里程碑。在 ResNet 之前，网络深度很难超过 20 层，因为深层网络存在梯度消失和退化问题。ResNet 通过残差连接（skip connection）解决了这个问题，使得训练数百层的网络成为可能。

残差连接的原理

残差连接的核心思想是：与其让网络学习目标映射 H(x)，不如学习残差映射 F(x) = H(x) - x，即 H(x) = F(x) + x。

这样做的好处是：如果恒等映射是最优解，残差网络只需学习 F(x) = 0，这比从头学习 H(x) = x 要容易得多。同时，残差连接为梯度提供了一条"高速公路"，缓解了梯度消失问题。

class BasicBlock(nn.Module):
    """ResNet基础块：用于ResNet-18和ResNet-34
    
    结构：两个3×3卷积 + 残差连接
    """
    expansion = 1  # 输出通道相对于输入通道的扩展倍数
    
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super().__init__()
        self.conv1 = nn.Conv2d(
            in_channels, out_channels,
            kernel_size=3, stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels,
            kernel_size=3, stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        identity = x
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        # 如果维度不匹配，需要调整残差路径
        if self.downsample is not None:
            identity = self.downsample(x)
        
        out += identity  # 残差连接
        out = self.relu(out)
        
        return out


class Bottleneck(nn.Module):
    """ResNet瓶颈块：用于ResNet-50/101/152
    
    结构：1×1降维 -&gt; 3×3卷积 -&gt; 1×1升维
    优点：减少参数量和计算量
    """
    expansion = 4  # 输出通道是中间通道的4倍
    
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super().__init__()
        # 1×1降维
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        # 3×3卷积
        self.conv2 = nn.Conv2d(
            out_channels, out_channels,
            kernel_size=3, stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        # 1×1升维
        self.conv3 = nn.Conv2d(
            out_channels, out_channels * self.expansion,
            kernel_size=1, bias=False
        )
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.downsample = downsample
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        identity = x
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        
        if self.downsample is not None:
            identity = self.downsample(x)
        
        out += identity
        out = self.relu(out)
        
        return out


class ResNet(nn.Module):
    """ResNet通用实现
    
    Args:
        block: BasicBlock 或 Bottleneck
        layers: 每个阶段的块数列表，如 [2, 2, 2, 2] 表示 ResNet-18
        num_classes: 分类数
    """
    def __init__(self, block, layers, num_classes=1000):
        super().__init__()
        self.in_channels = 64
        
        # 初始卷积层
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        # 四个阶段的残差块
        self.layer1 = self._make_layer(block, 64, layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        # 分类器
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)
    
    def _make_layer(self, block, out_channels, num_blocks, stride):
        """构建一个阶段的所有残差块"""
        downsample = None
        
        # 如果需要下采样或通道数变化，创建下采样层
        if stride != 1 or self.in_channels != out_channels * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(
                    self.in_channels, out_channels * block.expansion,
                    kernel_size=1, stride=stride, bias=False
                ),
                nn.BatchNorm2d(out_channels * block.expansion)
            )
        
        layers = []
        # 第一个块可能需要下采样
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        
        # 后续块的输入通道已经是 out_channels * expansion
        self.in_channels = out_channels * block.expansion
        for _ in range(1, num_blocks):
            layers.append(block(self.in_channels, out_channels))
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        # 初始处理
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)
        
        # 四个阶段
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        # 分类
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        
        return x


def resnet18(num_classes=1000):
    """ResNet-18: 18层"""
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):
    """ResNet-34: 34层"""
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):
    """ResNet-50: 50层"""
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):
    """ResNet-101: 101层"""
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)

def resnet152(num_classes=1000):
    """ResNet-152: 152层"""
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)

# 测试不同深度的ResNet
for name, model_fn in [('ResNet-18', resnet18), ('ResNet-50', resnet50)]:
    model = model_fn(num_classes=10)
    x = torch.randn(1, 3, 224, 224)
    output = model(x)
    num_params = sum(p.numel() for p in model.parameters())
    print(f"{name}: 输出{output.shape}, 参数量{num_params/1e6:.1f}M")

使用预训练模型

在实际应用中，通常使用在 ImageNet 上预训练的模型，然后进行微调：

import torchvision.models as models

# 加载预训练的ResNet-50
model = models.resnet50(pretrained=True)

# 冻结特征提取层（可选）
for param in model.parameters():
    param.requires_grad = False

# 替换最后的全连接层
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)  # 10类分类

# 只训练分类器
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

特征图可视化

理解 CNN 的一个重要方式是可视化各层的特征图，观察网络学到了什么。

def visualize_feature_maps(model, image, layer_names):
    """可视化指定层的特征图
    
    Args:
        model: CNN模型
        image: 输入图像 tensor [1, C, H, W]
        layer_names: 要可视化的层名列表
    """
    import matplotlib.pyplot as plt
    
    activations = {}
    
    def get_activation(name):
        def hook(model, input, output):
            activations[name] = output.detach()
        return hook
    
    # 注册hook
    handles = []
    for name, module in model.named_modules():
        if name in layer_names:
            handle = module.register_forward_hook(get_activation(name))
            handles.append(handle)
    
    # 前向传播
    with torch.no_grad():
        model(image)
    
    # 移除hook
    for handle in handles:
        handle.remove()
    
    # 可视化
    for name in layer_names:
        act = activations[name]
        num_channels = min(act.shape[1], 16)  # 最多显示16个通道
        
        fig, axes = plt.subplots(4, 4, figsize=(12, 12))
        fig.suptitle(f'Feature Maps: {name}', fontsize=16)
        
        for i in range(num_channels):
            ax = axes[i // 4, i % 4]
            ax.imshow(act[0, i].cpu(), cmap='viridis')
            ax.axis('off')
            ax.set_title(f'Channel {i}')
        
        plt.tight_layout()
        plt.show()

# 使用示例
# model = LeNet5()
# image = torch.randn(1, 1, 28, 28)
# visualize_feature_maps(model, image, ['conv1', 'conv2'])

浅层的特征图通常检测边缘、颜色等低级特征，深层的特征图则更抽象，与具体任务相关。

Grad-CAM：CNN可解释性

Grad-CAM（Gradient-weighted Class Activation Mapping）是一种可视化 CNN 决策依据的方法，通过梯度信息生成热力图，显示图像哪些区域对模型预测最重要。

Grad-CAM 原理

Grad-CAM 的核心公式是：

$L_{Grad-CAM}^c = ReLU\left( \sum_{k} \alpha_k^c A^k \right)$

其中：

$A^k$ 是最后一个卷积层的第 k 个特征图
$\alpha_k^c$ 是针对类别 c，特征图 k 的重要性权重
$\alpha_k^c$ 通过对梯度进行全局平均池化得到

Grad-CAM 实现

class GradCAM:
    """Grad-CAM实现
    
    使用PyTorch hooks获取特征图和梯度
    """
    def __init__(self, model, target_layer):
        self.model = model
        self.target_layer = target_layer
        self.gradients = None
        self.activations = None
        
        # 注册hooks
        target_layer.register_forward_hook(self.save_activation)
        target_layer.register_full_backward_hook(self.save_gradient)
    
    def save_activation(self, module, input, output):
        """保存前向传播的特征图"""
        self.activations = output.detach()
    
    def save_gradient(self, module, grad_input, grad_output):
        """保存反向传播的梯度"""
        self.gradients = grad_output[0].detach()
    
    def __call__(self, x, class_idx=None):
        """
        Args:
            x: 输入图像 [1, C, H, W]
            class_idx: 目标类别索引，如果为None则使用预测类别
        Returns:
            cam: 热力图 [H, W]
        """
        # 前向传播
        output = self.model(x)
        
        if class_idx is None:
            class_idx = output.argmax(dim=1).item()
        
        # 反向传播：只针对目标类别
        self.model.zero_grad()
        output[0, class_idx].backward(retain_graph=True)
        
        # 计算权重：对梯度进行全局平均池化
        # gradients shape: [1, C, H, W]
        weights = self.gradients.mean(dim=(2, 3), keepdim=True)  # [1, C, 1, 1]
        
        # 加权求和
        cam = (weights * self.activations).sum(dim=1, keepdim=True)  # [1, 1, H, W]
        cam = F.relu(cam)  # 只保留正向影响
        
        # 归一化
        cam = cam - cam.min()
        cam = cam / cam.max()
        
        # 上采样到输入尺寸
        cam = F.interpolate(cam, size=x.shape[2:], mode='bilinear', align_corners=False)
        
        return cam.squeeze().cpu().numpy()

import torch.nn.functional as F

# 使用示例
def apply_gradcam(model, image, target_class=None):
    """应用Grad-CAM可视化
    
    Args:
        model: CNN模型
        image: 输入图像tensor [C, H, W]
        target_class: 目标类别
    """
    import matplotlib.pyplot as plt
    from PIL import Image
    
    model.eval()
    
    # 找到目标卷积层（通常是最后一个卷积层）
    target_layer = None
    for name, module in model.named_modules():
        if isinstance(module, nn.Conv2d):
            target_layer = module
    
    if target_layer is None:
        print("未找到卷积层")
        return
    
    # 创建Grad-CAM
    gradcam = GradCAM(model, target_layer)
    
    # 添加batch维度
    image_batch = image.unsqueeze(0)
    if image_batch.dim() == 3:
        image_batch = image_batch.unsqueeze(0)
    
    # 生成热力图
    cam = gradcam(image_batch, target_class)
    
    # 可视化
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # 原图
    img_np = image.cpu().numpy().transpose(1, 2, 0)
    if img_np.shape[2] == 1:
        img_np = img_np.squeeze(-1)
        axes[0].imshow(img_np, cmap='gray')
    else:
        axes[0].imshow(img_np)
    axes[0].set_title('Original Image')
    axes[0].axis('off')
    
    # 热力图
    axes[1].imshow(cam, cmap='jet')
    axes[1].set_title('Grad-CAM Heatmap')
    axes[1].axis('off')
    
    # 叠加图
    if img_np.max() &gt; 1:
        img_np = img_np / 255.0
    axes[2].imshow(img_np)
    axes[2].imshow(cam, cmap='jet', alpha=0.5)
    axes[2].set_title('Overlay')
    axes[2].axis('off')
    
    plt.tight_layout()
    plt.show()

通过 Grad-CAM，我们可以理解模型做出预测的依据。例如，如果模型将一张狗的图片分类为"狗"，Grad-CAM 应该高亮显示狗的区域，而不是背景。如果高亮区域不正确，说明模型可能学习到了错误的特征，需要进一步调整。

CNN 训练实战：完整流程

下面以 CIFAR-10 数据集为例，展示 CNN 训练的完整流程：

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 1. 定义模型
class SimpleCNN(nn.Module):
    """适用于CIFAR-10的CNN"""
    def __init__(self, num_classes=10):
        super().__init__()
        
        self.features = nn.Sequential(
            # Block 1: 32×32 -&gt; 16×16
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.25),
            
            # Block 2: 16×16 -&gt; 8×8
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.25),
            
            # Block 3: 8×8 -&gt; 4×4
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Dropout(0.25),
        )
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# 2. 数据准备
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])

train_dataset = datasets.CIFAR10('./data', train=True, download=True, transform=train_transform)
test_dataset = datasets.CIFAR10('./data', train=False, download=True, transform=test_transform)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=2)

# 3. 训练配置
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# 4. 训练函数
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    return total_loss / len(loader), 100. * correct / total

def evaluate(model, loader, device):
    model.eval()
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return 100. * correct / total

# 5. 训练循环
num_epochs = 50
best_acc = 0

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_acc = evaluate(model, test_loader, device)
    scheduler.step()
    
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f&quot;  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f&quot;  Test Acc: {test_acc:.2f}%")
    
    if test_acc &gt; best_acc:
        best_acc = test_acc
        torch.save(model.state_dict(), 'best_model.pth')

print(f"\n最佳测试准确率: {best_acc:.2f}%")

CNN 设计技巧与最佳实践

网络结构设计原则

使用小卷积核堆叠：多个 3×3 卷积比一个大卷积核更高效
通道数递增：随着网络加深，空间分辨率降低，通道数应增加
使用批量归一化：加速训练，提高稳定性
残差连接：深层网络的标配
适当的位置使用 Dropout：防止过拟合

数据增强

数据增强可以有效扩充训练数据，提高模型泛化能力：

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),           # 随机裁剪并调整大小
    transforms.RandomHorizontalFlip(),           # 随机水平翻转
    transforms.RandomRotation(15),               # 随机旋转
    transforms.ColorJitter(                      # 颜色抖动
        brightness=0.2, contrast=0.2, 
        saturation=0.2, hue=0.1
    ),
    transforms.RandomGrayscale(p=0.1),           # 随机灰度
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

学习率策略

合理的学习率策略对训练至关重要：

# 余弦退火
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# 带热重启的余弦退火
scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)

# ReduceLROnPlateau：验证损失停滞时降低学习率
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5)

# OneCycleLR：训练周期内学习率先增后减
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, total_steps=1000)

下一步

掌握了 CNN 的基础知识后，你可以继续学习：

模型保存与加载
迁移学习
目标检测与语义分割
生成对抗网络 (GAN)

卷积的基本概念​

什么是卷积？​

CNN 的三大核心思想​

卷积层详解​

nn.Conv2d 参数全解析​

输出尺寸计算​

填充 (Padding) 的作用​

步长 (Stride) 的影响​

高级卷积类型​

空洞卷积 (Dilated/Atrous Convolution)​

分组卷积 (Grouped Convolution)​

深度可分离卷积 (Depthwise Separable Convolution)​

转置卷积 (Transposed Convolution)​

池化层 (Pooling)​

最大池化与平均池化​

自适应池化​

感受野计算​

感受野计算公式​

感受野的实际意义​

经典 CNN 架构实现​

LeNet-5​

VGGNet​

ResNet：残差网络​

残差连接的原理​

使用预训练模型​

特征图可视化​

Grad-CAM：CNN可解释性​

Grad-CAM 原理​

Grad-CAM 实现​

CNN 训练实战：完整流程​

CNN 设计技巧与最佳实践​

网络结构设计原则​

数据增强​

学习率策略​

下一步​