HTTP 基础

理解 HTTP 协议是写好爬虫的基础。本章将介绍 HTTP 的核心概念，帮助你更好地理解网络请求和响应。

什么是 HTTP？

HTTP（HyperText Transfer Protocol，超文本传输协议）是客户端和服务器之间的通信协议。当你在浏览器中访问一个网页时，浏览器会向服务器发送 HTTP 请求，服务器则返回 HTTP 响应。

URL 详解

URL（Uniform Resource Locator，统一资源定位符）是网页的地址。一个完整的 URL 包含多个部分：

https://www.example.com:443/path/to/page?id=123#section
│       │              │   │              │         │
│       │              │   │              │         └─ 锚点（不发送给服务器）
│       │              │   │              └─────────── 查询参数
│       │              │   └────────────────────────── 路径
│       │              └────────────────────────────── 端口
│       └──────────────────────────────────────────── 域名
└──────────────────────────────────────────────────── 协议

URL 各部分说明：

部分	说明	示例
协议	通信协议	http, https, ftp
域名	服务器地址	www.example.com
端口	服务端口（可选）	:443, :8080
路径	资源位置	/path/to/page
查询参数	传递给服务器的参数	?id=123&page=1
锚点	页面内定位	#section

HTTP 请求

请求方法

HTTP 定义了多种请求方法，最常用的是 GET 和 POST：

方法	说明	用途
GET	获取资源	浏览网页、获取数据
POST	提交数据	登录、提交表单
PUT	更新资源	更新数据
DELETE	删除资源	删除数据
HEAD	获取响应头	检查资源是否存在

GET 请求

GET 请求将参数放在 URL 中，用于请求数据：

GET /search?q=python&page=1 HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0
Accept: text/html

特点：

参数暴露在 URL 中
适合查询操作
有长度限制（浏览器通常限制 2000 字符左右）

POST 请求

POST 请求将数据放在请求体中，用于提交数据：

POST /login HTTP/1.1
Host: www.example.com
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0

username=admin&password=123456

特点：

数据放在请求体中，更安全
无大小限制
适合提交表单、上传文件

请求头详解

请求头（Request Headers）包含关于请求的元数据，是 HTTP 请求的重要组成部分。在爬虫开发中，正确设置请求头可以模拟真实浏览器行为，绕过简单的反爬检测。

常用请求头示例

# 常见的请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Referer': 'https://www.example.com/',  # 来源页面
    'Cookie': 'session_id=abc123',          # Cookie 信息
}

核心请求头详解

请求头	说明	示例值
User-Agent	标识客户端类型，服务器据此判断请求来源	`Mozilla/5.0 (Windows NT 10.0; Win64; x64)...`
Accept	告诉服务器客户端能接受的响应类型	`text/html,application/xhtml+xml,application/xml`
Accept-Language	客户端偏好语言	`zh-CN,zh;q=0.9,en;q=0.8`
Accept-Encoding	可接受的内容编码方式	`gzip, deflate, br`
Accept-Charset	可接受的字符集	`utf-8, iso-8859-1`
Connection	连接管理方式	`keep-alive` 或 `close`
Referer	请求的来源页面 URL	`https://www.google.com/`
Cookie	携带的 Cookie 信息	`session_id=abc123; user_token=xyz`
Host	目标服务器的主机名和端口	`www.example.com:8080`
Content-Type	请求体的媒体类型（POST/PUT 时使用）	`application/json` 或 `application/x-www-form-urlencoded`
Content-Length	请求体的字节长度	`1024`
Authorization	认证信息	`Bearer eyJhbGciOiJIUzI1NiIs...`
Cache-Control	缓存控制指令	`no-cache`
If-Modified-Since	条件请求，只在指定时间后修改才返回	`Sat, 01 Jan 2024 00:00:00 GMT`
If-None-Match	条件请求，配合 ETag 使用	`"abc123"`

User-Agent 详解

User-Agent 是最重要的请求头之一，服务器通常用它来识别客户端类型：

# 常见浏览器 User-Agent
USER_AGENTS = {
    # Chrome 浏览器
    'chrome_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'chrome_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'chrome_linux': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    
    # Firefox 浏览器
    'firefox_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'firefox_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
    
    # Safari 浏览器
    'safari_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
    
    # Edge 浏览器
    'edge': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
    
    # 移动端
    'iphone': 'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',
    'android': 'Mozilla/5.0 (Linux; Android 13; SM-S918B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.43 Mobile Safari/537.36',
    
    # 爬虫标识（诚实声明）
    'googlebot': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
    'bingbot': 'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)',
}

# 使用示例
import random

headers = {
    'User-Agent': random.choice(list(USER_AGENTS.values()))
}

User-Agent 的结构解析：

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
│          │                          │                     │                      │                   │
│          │                          │                     │                      │                   └─ 浏览器版本
│          │                          │                     │                      └─ 渲染引擎版本
│          │                          │                     └─ 渲染引擎
│          │                          └─ 操作系统信息
│          └─ 兼容性标记（历史原因保留）
└─ 产品名称

Accept 系列请求头

Accept 系列请求头用于内容协商，告诉服务器客户端的偏好：

# Accept - 媒体类型
# 格式：type/subtype; q=质量因子
# q 值范围 0-1，默认为 1，表示优先级
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'

# 解析：
# text/html                          - 最优先接受 HTML
# application/xhtml+xml              - 其次接受 XHTML
# application/xml;q=0.9              - XML 类型优先级 0.9
# image/webp                         - 接受 WebP 图片
# */*;q=0.8                          - 其他类型优先级 0.8

# Accept-Language - 语言偏好
headers['Accept-Language'] = 'zh-CN,zh;q=0.9,en;q=0.8,en-US;q=0.7'

# 解析：
# zh-CN    - 简体中文（最优先）
# zh;q=0.9 - 中文（优先级 0.9）
# en;q=0.8 - 英文（优先级 0.8）

# Accept-Encoding - 内容编码
headers['Accept-Encoding'] = 'gzip, deflate, br'

# 常见编码：
# gzip     - GNU zip 压缩
# deflate  - zlib 压缩
# br       - Brotli 压缩（较新，压缩率更高）
# identity - 不压缩

Referer 和 Origin

这两个请求头用于指示请求来源：

# Referer（注意拼写，标准中就是错的）
# 表示当前请求是从哪个页面发起的
headers['Referer'] = 'https://www.google.com/search?q=python'

# Origin
# 用于 CORS 跨域请求，只包含协议+域名+端口
headers['Origin'] = 'https://example.com'

# 爬虫中的常见用法：
# 1. 伪装从搜索引擎进入
headers['Referer'] = 'https://www.google.com/'

# 2. 伪装从站内其他页面跳转
headers['Referer'] = 'https://example.com/index.html'

# 3. API 请求时设置 Origin
headers['Origin'] = 'https://example.com'

Content-Type 详解

Content-Type 指定请求体的数据格式，在 POST/PUT 请求中非常重要：

import requests
import json

# 1. application/x-www-form-urlencoded
# 表单默认格式，键值对形式
data = {'username': 'admin', 'password': '123456'}
response = requests.post(
    'https://example.com/login',
    data=data,  # requests 会自动设置 Content-Type
)

# 2. application/json
# JSON 格式，API 常用
response = requests.post(
    'https://example.com/api/users',
    json=data,  # 使用 json 参数会自动设置 Content-Type: application/json
)

# 或者手动设置
response = requests.post(
    'https://example.com/api/users',
    data=json.dumps(data),
    headers={'Content-Type': 'application/json'}
)

# 3. multipart/form-data
# 文件上传时使用
files = {'file': open('document.pdf', 'rb')}
response = requests.post(
    'https://example.com/upload',
    files=files  # requests 会自动设置正确的 Content-Type 和 boundary
)

# 4. text/plain
# 纯文本
response = requests.post(
    'https://example.com/api',
    data='plain text content',
    headers={'Content-Type': 'text/plain'}
)

# 5. application/xml
# XML 格式
xml_data = '<?xml version="1.0"?><root><name>test</name></root>'
response = requests.post(
    'https://example.com/api',
    data=xml_data,
    headers={'Content-Type': 'application/xml'}
)

Authorization 认证头

import requests
import base64

# 1. Basic 认证
credentials = base64.b64encode(b'username:password').decode()
headers = {
    'Authorization': f'Basic {credentials}'
}

# 或使用 requests 的 auth 参数
response = requests.get('https://example.com/api', auth=('username', 'password'))

# 2. Bearer Token（OAuth 2.0）
token = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
headers = {
    'Authorization': f'Bearer {token}'
}

# 3. API Key
headers = {
    'X-API-Key': 'your-api-key-here'
}

# 或放在 Authorization 中
headers = {
    'Authorization': 'ApiKey your-api-key-here'
}

自定义请求头

很多网站使用自定义请求头传递特定信息：

# 常见的自定义请求头
headers = {
    # API 相关
    'X-Requested-With': 'XMLHttpRequest',  # 标识 AJAX 请求
    'X-CSRF-Token': 'token-value',          # CSRF 防护令牌
    'X-API-Version': 'v1',                  # API 版本
    
    # 客户端信息
    'X-Client-ID': 'mobile-app',
    'X-Device-ID': 'device-uuid',
    'X-Platform': 'android',
    
    # 认证相关
    'X-Auth-Token': 'auth-token-value',
    'X-Session-ID': 'session-id',
    
    # 内容协商
    'X-Format': 'json',
}

完整的请求头设置示例

import requests
import random

def create_headers(referer=None):
    """创建完整的请求头"""
    
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    ]
    
    headers = {
        'User-Agent': random.choice(user_agents),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
    }
    
    if referer:
        headers['Referer'] = referer
    
    return headers

# 使用示例
session = requests.Session()
session.headers.update(create_headers('https://www.google.com/'))
response = session.get('https://example.com')

HTTP 响应

状态码

HTTP 响应包含一个状态码，表示请求的结果：

常见状态码：

状态码	含义	说明
200	OK	请求成功
201	Created	资源创建成功
301	Moved Permanently	永久重定向
302	Found	临时重定向
304	Not Modified	资源未修改，使用缓存
400	Bad Request	请求语法错误
401	Unauthorized	需要认证
403	Forbidden	拒绝访问
404	Not Found	资源不存在
429	Too Many Requests	请求过于频繁
500	Internal Server Error	服务器内部错误
502	Bad Gateway	网关错误
503	Service Unavailable	服务不可用
504	Gateway Timeout	网关超时

响应头

响应头包含关于响应的元数据：

# 常见的响应头
response.headers = {
    'Content-Type': 'text/html; charset=utf-8',  # 内容类型
    'Content-Length': '12345',                    # 内容长度
    'Server': 'nginx/1.18.0',                     # 服务器类型
    'Date': 'Sat, 25 Mar 2024 00:00:00 GMT',    # 响应时间
    'Set-Cookie': 'session=abc123; HttpOnly',    # 设置 Cookie
    'Cache-Control': 'no-cache',                  # 缓存控制
    'Expires': 'Sat, 25 Mar 2024 00:00:00 GMT', # 过期时间
}

响应体

响应体是服务器返回的实际内容，常见格式：

HTML - 网页内容
JSON - API 接口常用
XML - 某些老旧 API 使用
图片/视频/文件 - 二进制数据
纯文本 - 简单文本内容

Cookie 是服务器保存在客户端的数据，用于维持会话状态：

import requests

# 方式1：手动设置 Cookie
cookies = {
    'session_id': 'abc123',
    'user_token': 'xyz789'
}
response = requests.get('https://example.com/profile', cookies=cookies)

# 方式2：使用 Session 自动管理 Cookie
session = requests.Session()

# 登录（服务器会设置 Cookie）
session.post('https://example.com/login', data={
    'username': 'user',
    'password': 'pass'
})

# 后续请求会自动携带 Cookie
response = session.get('https://example.com/profile')
print(response.text)

Session 和 Token

Session 认证

传统的会话认证方式：

# Session 认证流程
session = requests.Session()

# 1. 登录
login_data = {
    'username': 'your_username',
    'password': 'your_password'
}
session.post('https://example.com/login', data=login_data)

# 2. 访问需要登录的页面
response = session.get('https://example.com/member/info')

Token 认证

现代 API 常用的认证方式：

import requests

# 1. 获取 Token
auth_data = {
    'username': 'your_username',
    'password': 'your_password'
}
response = requests.post('https://example.com/api/login', json=auth_data)
token = response.json()['token']

# 2. 使用 Token 请求
headers = {
    'Authorization': f'Bearer {token}'
}
response = requests.get('https://example.com/api/user', headers=headers)

HTTP 代理

使用代理可以隐藏真实 IP，避免被封禁：

import requests

# 使用代理
proxies = {
    'http': 'http://127.0.0.1:7890',
    'https': 'http://127.0.0.1:7890'
}

response = requests.get('https://example.com', proxies=proxies)

# 认证代理
proxies_with_auth = {
    'http': 'http://user:[email protected]:7890',
}

HTTPS 和 SSL

HTTPS 是 HTTP 的安全版本，使用 SSL/TLS 加密：

import requests

# 忽略 SSL 证书验证（不推荐，仅用于测试）
response = requests.get('https://example.com', verify=False)

# 指定 SSL 证书
response = requests.get('https://example.com', cert='/path/to/cert.pem')

HTTP/2 和 HTTP/3

随着 Web 技术的发展，HTTP 协议也在不断演进。了解 HTTP/2 和 HTTP/3 的特性对于编写高性能爬虫非常重要。

HTTP/1.1 的局限性

HTTP/1.1 是目前最广泛使用的版本，但存在一些性能瓶颈：

问题	说明	影响
队头阻塞	同一连接上的请求必须顺序处理	一个慢请求阻塞后续请求
连接开销	每个域名需要建立多个连接	增加服务器负担
冗余头部	每个请求都携带完整头部	浪费带宽
文本协议	明文传输，解析效率低	性能受限

HTTP/2 核心特性

HTTP/2 在不改变 HTTP 语义的前提下，大幅提升了性能：

多路复用（Multiplexing）

HTTP/2 允许在单个 TCP 连接上同时发送多个请求和响应：

# 使用支持 HTTP/2 的库
import httpx

# httpx 自动协商使用 HTTP/2
async with httpx.AsyncClient(http2=True) as client:
    # 这些请求在同一个连接上并发执行
    responses = await asyncio.gather(
        client.get('https://example.com/api/1'),
        client.get('https://example.com/api/2'),
        client.get('https://example.com/api/3'),
    )

爬虫中的优势：

减少 TCP 连接数，降低服务器负载
避免队头阻塞，提高并发效率
更好地利用带宽

头部压缩（HPACK）

HTTP/2 使用 HPACK 算法压缩头部，减少传输数据量：

# HTTP/1.1 请求头（每次都要完整发送）
GET /api/data HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: application/json
Accept-Language: zh-CN,zh;q=0.9
Accept-Encoding: gzip, deflate, br
Connection: keep-alive

# HTTP/2 请求头（使用差分编码，只发送变化部分）
:method: GET
:path: /api/data
# 其他头部只发送与前一个请求的差异

二进制分帧

HTTP/2 将消息分解为更小的帧：

帧类型：

DATA 帧：传输实际数据
HEADERS 帧：传输头部信息
SETTINGS 帧：连接配置
RST_STREAM 帧：取消请求
PING 帧：心跳检测

服务端推送（Server Push）

服务器可以主动向客户端推送资源：

# 注意：服务端推送已被 Chrome 移除支持
# 但某些服务器仍支持此功能

# 使用 httpx 接收推送（如果服务器支持）
async with httpx.AsyncClient(http2=True) as client:
    async with client.stream('GET', 'https://example.com') as response:
        # 处理主响应
        content = await response.aread()
        
        # 检查是否有服务端推送
        # （需要特定库支持）

流优先级

HTTP/2 支持为请求设置优先级：

# HTTP/2 允许设置流的权重和依赖关系
# 这在爬虫中可以用于优先加载关键资源

# 大多数 Python HTTP 库会自动处理优先级
# 如果需要精细控制，可以使用更底层的库

HTTP/3：基于 QUIC

HTTP/3 是最新的 HTTP 版本，使用 QUIC 协议替代 TCP：

HTTP/3 的优势

特性	HTTP/2 (TCP)	HTTP/3 (QUIC)
连接建立	需要 TCP 握手 + TLS 握手（2-3 RTT）	0-1 RTT
队头阻塞	TCP 层仍存在	完全解决
连接迁移	IP 变化需重新连接	支持无缝迁移
丢包恢复	整个连接受影响	只影响单个流

Python 中的 HTTP/3 支持

# 使用 httpx 支持 HTTP/3
import httpx

# 启用 HTTP/3（需要服务器支持）
async with httpx.AsyncClient(http3=True) as client:
    response = await client.get('https://example.com')
    print(f'HTTP 版本: {response.http_version}')  # 可能输出 HTTP/3

# 安装 HTTP/3 支持
pip install httpx[http2,http3]

在爬虫中应用 HTTP/2 和 HTTP/3

性能对比

import asyncio
import httpx
import time

async def benchmark():
    urls = [f'https://httpbin.org/delay/1?id={i}' for i in range(10)]
    
    # HTTP/1.1
    start = time.time()
    async with httpx.AsyncClient(http2=False) as client:
        tasks = [client.get(url) for url in urls]
        await asyncio.gather(*tasks)
    http1_time = time.time() - start
    
    # HTTP/2
    start = time.time()
    async with httpx.AsyncClient(http2=True) as client:
        tasks = [client.get(url) for url in urls]
        await asyncio.gather(*tasks)
    http2_time = time.time() - start
    
    print(f'HTTP/1.1: {http1_time:.2f}s')
    print(f'HTTP/2:   {http2_time:.2f}s')

asyncio.run(benchmark())

最佳实践

import httpx

class ModernSpider:
    """使用现代 HTTP 协议的爬虫"""
    
    def __init__(self):
        # 启用 HTTP/2 支持
        self.client = httpx.AsyncClient(
            http2=True,           # 启用 HTTP/2
            http3=True,           # 启用 HTTP/3（如果可用）
            limits=httpx.Limits(
                max_keepalive_connections=20,
                max_connections=100,
                keepalive_expiry=30.0,
            ),
            timeout=httpx.Timeout(30.0),
        )
    
    async def fetch(self, url):
        """获取页面"""
        response = await self.client.get(url)
        
        # 查看使用的 HTTP 版本
        print(f'HTTP 版本: {response.http_version}')
        
        return response
    
    async def close(self):
        """关闭客户端"""
        await self.client.aclose()

检测服务器支持的 HTTP 版本

import httpx

async def check_http_version(url):
    """检测服务器支持的 HTTP 版本"""
    
    async with httpx.AsyncClient(http2=True, http3=True) as client:
        response = await client.get(url)
        
        version = response.http_version
        if version == 'HTTP/3':
            print('服务器支持 HTTP/3')
        elif version == 'HTTP/2':
            print('服务器支持 HTTP/2')
        else:
            print('服务器仅支持 HTTP/1.1')
        
        return version

# 运行检测
asyncio.run(check_http_version('https://www.google.com'))

注意事项

兼容性：并非所有服务器都支持 HTTP/2 或 HTTP/3
调试困难：二进制协议比文本协议更难调试
库支持：确保使用的 HTTP 库支持所需版本
回退机制：现代 HTTP 库会自动协商最佳版本

小结

本章我们学习了：

URL 构成 - 协议、域名、端口、路径、查询参数
HTTP 请求方法 - GET、POST 等
请求头和响应头 - HTTP 元数据
状态码 - 表示请求结果
Cookie 和 Session - 会话管理
代理 - 隐藏真实 IP

理解这些 HTTP 基础知识对于编写高效的爬虫至关重要。

练习

使用浏览器开发者工具查看一个网页的请求和响应头
分析 URL 中的查询参数如何构造
理解 301、302、404、429 等常见状态码的含义

什么是 HTTP？​

URL 详解​

HTTP 请求​

请求方法​

GET 请求​

POST 请求​

请求头详解​

常用请求头示例​

核心请求头详解​

User-Agent 详解​

Accept 系列请求头​

Referer 和 Origin​

Content-Type 详解​

Authorization 认证头​

自定义请求头​

完整的请求头设置示例​

HTTP 响应​

状态码​

响应头​

响应体​

Cookie 和 Session​

Cookie 机制​

在爬虫中使用 Cookie​

Session 和 Token​

Session 认证​

Token 认证​

HTTP 代理​

HTTPS 和 SSL​

HTTP/2 和 HTTP/3​

HTTP/1.1 的局限性​

HTTP/2 核心特性​

多路复用（Multiplexing）​

头部压缩（HPACK）​

二进制分帧​

服务端推送（Server Push）​

流优先级​

HTTP/3：基于 QUIC​

HTTP/3 的优势​

Python 中的 HTTP/3 支持​

在爬虫中应用 HTTP/2 和 HTTP/3​

性能对比​

最佳实践​

检测服务器支持的 HTTP 版本​

注意事项​

小结​

练习​