httpx 库详解
httpx 是一个现代化的 Python HTTP 客户端库,它继承了 requests 的简洁 API,同时提供了原生异步支持和 HTTP/2 能力。对于需要高性能的爬虫项目,httpx 是 requests 的优秀替代品。
官方文档
本教程内容基于 httpx 官方文档。
当前版本: v0.28.1 | Python 要求: 3.9+
为什么选择 httpx?
httpx vs requests
| 特性 | httpx | requests |
|---|---|---|
| 同步 API | ✅ | ✅ |
| 异步 API | ✅ 原生支持 | ❌ 需要第三方库 |
| HTTP/1.1 | ✅ | ✅ |
| HTTP/2 | ✅ | ❌ |
| 连接池 | ✅ | ✅ |
| 超时设置 | 默认 5 秒 | 无默认值 |
| 类型注解 | ✅ 完整 | ❌ |
| WebSocket | ✅ | ❌ |
适用场景
- 异步爬虫:需要高并发请求的场景
- HTTP/2 支持:需要使用 HTTP/2 协议的网站
- 现代 API 客户端:需要类型提示和更好的 IDE 支持
- ASGI/WSGI 测试:直接调用 Python Web 应用
- 长连接场景:WebSocket、Server-Sent Events
安装
# 基础安装
pip install httpx
# 包含 HTTP/2 支持
pip install httpx[http2]
# 包含 SOCKS 代理支持
pip install httpx[socks]
# 包含命令行工具
pip install httpx[cli]
# 包含所有可选依赖
pip install httpx[http2,socks,brotli,zstd]
命令行工具
安装 cli 依赖后,可以直接在命令行使用 httpx:
# 发送请求
httpx https://httpbin.org/get
# POST 请求
httpx POST https://httpbin.org/post key=value
# 查看帮助
httpx --help
快速开始
同步请求
httpx 的同步 API 与 requests 高度兼容:
import httpx
# GET 请求
response = httpx.get('https://httpbin.org/get')
print(response.status_code) # 200
print(response.text) # 响应文本
print(response.json()) # JSON 解析
# POST 请求
response = httpx.post(
'https://httpbin.org/post',
data={'key': 'value'}
)
# 其他请求方法
response = httpx.put('https://httpbin.org/put', data={'key': 'value'})
response = httpx.delete('https://httpbin.org/delete')
response = httpx.head('https://httpbin.org/get')
response = httpx.options('https://httpbin.org/get')
response = httpx.patch('https://httpbin.org/patch', data={'key': 'value'})
异步请求
httpx 原生支持异步,这是它相比 requests 的主要优势:
import httpx
import asyncio
async def fetch():
async with httpx.AsyncClient() as client:
response = await client.get('https://httpbin.org/get')
print(response.status_code)
print(response.json())
asyncio.run(fetch())
请求参数
URL 参数
使用 params 参数添加查询参数:
import httpx
# 字典形式
params = {'key1': 'value1', 'key2': 'value2'}
response = httpx.get('https://httpbin.org/get', params=params)
print(response.url) # https://httpbin.org/get?key1=value1&key2=value2
# 列表形式(多值参数)
params = {'key': ['value1', 'value2']}
response = httpx.get('https://httpbin.org/get', params=params)
print(response.url) # https://httpbin.org/get?key=value1&key=value2
# 元组列表形式
params = [('key1', 'value1'), ('key2', 'value2')]
response = httpx.get('https://httpbin.org/get', params=params)
请求头
使用 headers 参数设置请求头:
import httpx
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json',
'Authorization': 'Bearer token123',
}
response = httpx.get('https://httpbin.org/headers', headers=headers)
print(response.json())
Cookie
使用 cookies 参数发送 Cookie:
import httpx
# 字典形式
cookies = {'session_id': 'abc123', 'user_token': 'xyz789'}
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)
print(response.json())
# httpx.Cookies 对象
cookies = httpx.Cookies()
cookies.set('cookie_name', 'cookie_value', domain='httpbin.org')
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)
# 获取响应中的 Cookie
response = httpx.get('https://httpbin.org/cookies/set?name=value')
print(response.cookies['name']) # value
请求体
表单数据
import httpx
# 表单数据
data = {'username': 'admin', 'password': '123456'}
response = httpx.post('https://httpbin.org/post', data=data)
print(response.json()['form'])
# 多值表单
data = {'items': ['item1', 'item2', 'item3']}
response = httpx.post('https://httpbin.org/post', data=data)
JSON 数据
import httpx
# JSON 数据(自动设置 Content-Type)
json_data = {'name': '张三', 'age': 25, 'hobbies': ['Python', '爬虫']}
response = httpx.post('https://httpbin.org/post', json=json_data)
print(response.json()['json'])
文件上传
import httpx
# 上传文件
with open('report.pdf', 'rb') as f:
files = {'file': f}
response = httpx.post('https://httpbin.org/post', files=files)
# 指定文件名和类型
with open('report.pdf', 'rb') as f:
files = {'file': ('report.pdf', f, 'application/pdf')}
response = httpx.post('https://httpbin.org/post', files=files)
# 多文件上传
files = [
('files', ('file1.txt', open('file1.txt', 'rb'))),
('files', ('file2.txt', open('file2.txt', 'rb'))),
]
response = httpx.post('https://httpbin.org/post', files=files)
# 文件和表单混合
data = {'description': '文件描述'}
with open('report.pdf', 'rb') as f:
files = {'file': f}
response = httpx.post('https://httpbin.org/post', data=data, files=files)
二进制数据
import httpx
# 发送二进制数据
content = b'Hello, World!'
response = httpx.post(
'https://httpbin.org/post',
content=content,
headers={'Content-Type': 'application/octet-stream'}
)
响应处理
响应属性
import httpx
response = httpx.get('https://httpbin.org/get')
# 状态码
print(response.status_code) # 200
print(response.reason_phrase) # 'OK'
# 判断状态码
print(response.is_success) # 2xx
print(response.is_redirect) # 3xx
print(response.is_client_error) # 4xx
print(response.is_server_error) # 5xx
# 响应头
print(response.headers)
print(response.headers['content-type'])
print(response.headers.get('Content-Type'))
# URL 信息
print(response.url) # 最终 URL
print(response.request) # 请求对象
# 编码
print(response.encoding) # 编码方式
response.encoding = 'utf-8' # 手动设置编码
响应内容
import httpx
response = httpx.get('https://httpbin.org/get')
# 文本内容(自动解码)
print(response.text)
# 字节内容
print(response.content)
# JSON 解析
print(response.json())
# 流式读取(大文件)
with httpx.stream('GET', 'https://example.com/large-file') as response:
for chunk in response.iter_bytes(chunk_size=8192):
# 处理数据块
process(chunk)
状态码处理
import httpx
# 检查状态码
response = httpx.get('https://httpbin.org/get')
if response.status_code == httpx.codes.OK:
print('请求成功')
# 抛出异常(非 2xx 状态码)
try:
response = httpx.get('https://httpbin.org/status/404')
response.raise_for_status()
except httpx.HTTPStatusError as e:
print(f'HTTP 错误: {e.response.status_code}')
print(f'请求 URL: {e.request.url}')
# 链式调用
data = httpx.get('https://api.example.com/data').raise_for_status().json()
Client 会话
使用 Client 可以实现连接复用、Cookie 持久化和默认配置。
同步 Client
import httpx
# 使用上下文管理器(推荐)
with httpx.Client() as client:
# 所有请求共享连接池和 Cookie
response = client.get('https://httpbin.org/get')
response = client.post('https://httpbin.org/post', json={'key': 'value'})
# 手动管理
client = httpx.Client()
try:
response = client.get('https://httpbin.org/get')
finally:
client.close()
配置 Client
import httpx
# 基本配置
client = httpx.Client(
base_url='https://api.example.com', # 基础 URL
headers={'User-Agent': 'MyBot/1.0'}, # 默认请求头
cookies={'session': 'abc123'}, # 默认 Cookie
timeout=30.0, # 超时时间
follow_redirects=True, # 跟随重定向
max_redirects=20, # 最大重定向次数
http2=True, # 启用 HTTP/2
verify=True, # SSL 验证
)
with client:
# 相对于 base_url 的请求
response = client.get('/users') # https://api.example.com/users
response = client.post('/login', json={'user': 'admin', 'pass': '123'})
异步 Client
import httpx
import asyncio
async def main():
# 异步客户端
async with httpx.AsyncClient() as client:
response = await client.get('https://httpbin.org/get')
print(response.json())
# 带配置的异步客户端
async with httpx.AsyncClient(
base_url='https://api.example.com',
timeout=30.0,
http2=True
) as client:
# 并发请求
tasks = [
client.get(f'/users/{i}')
for i in range(10)
]
responses = await asyncio.gather(*tasks)
for response in responses:
print(response.json())
asyncio.run(main())
超时设置
httpx 默认有 5 秒的超时,这是与 requests 的重要区别。
基本超时
import httpx
# 全局超时
response = httpx.get('https://httpbin.org/delay/1', timeout=10.0)
# 禁用超时
response = httpx.get('https://httpbin.org/delay/1', timeout=None)
# 在 Client 中设置
with httpx.Client(timeout=30.0) as client:
response = client.get('https://httpbin.org/get')
精细超时控制
import httpx
# 不同阶段的超时
timeout = httpx.Timeout(
connect=5.0, # 连接超时
read=10.0, # 读取超时
write=10.0, # 写入超时
pool=5.0 # 连接池等待超时
)
response = httpx.get('https://httpbin.org/get', timeout=timeout)
# 简写形式(connect 和 read 使用相同值)
timeout = httpx.Timeout(10.0) # connect=10.0, read=10.0
# 在 Client 中使用
with httpx.Client(timeout=timeout) as client:
response = client.get('https://httpbin.org/get')
代理设置
HTTP 代理
import httpx
# 单个代理
proxies = {
'http://': 'http://127.0.0.1:7890',
'https://': 'http://127.0.0.1:7890'
}
response = httpx.get('https://httpbin.org/ip', proxies=proxies)
# 简写形式
response = httpx.get('https://httpbin.org/ip', proxy='http://127.0.0.1:7890')
# 认证代理
proxies = {
'http://': 'http://user:[email protected]:8080'
}
# 在 Client 中设置
with httpx.Client(proxies=proxies) as client:
response = client.get('https://httpbin.org/ip')
SOCKS 代理
需要安装 httpx[socks]:
import httpx
# SOCKS5 代理
proxies = {
'all://': 'socks5://127.0.0.1:1080'
}
response = httpx.get('https://httpbin.org/ip', proxies=proxies)
# 认证 SOCKS 代理
proxies = {
'all://': 'socks5://user:[email protected]:1080'
}
代理池
import httpx
import random
proxies_list = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]
def get_random_proxy():
return random.choice(proxies_list)
# 每次请求使用不同代理
for url in urls:
proxy = get_random_proxy()
response = httpx.get(url, proxy=proxy)
流式响应
对于大文件下载,使用流式响应避免内存溢出:
同步流式
import httpx
# 流式下载文件
with httpx.stream('GET', 'https://example.com/large-file.zip') as response:
with open('large-file.zip', 'wb') as f:
for chunk in response.iter_bytes(chunk_size=8192):
f.write(chunk)
# 流式读取文本
with httpx.stream('GET', 'https://example.com/large-text.txt') as response:
for line in response.iter_lines():
print(line)
# 条件读取
with httpx.stream('GET', 'https://example.com/file') as response:
content_length = int(response.headers.get('content-length', 0))
if content_length < 10_000_000: # 小于 10MB
response.read()
print(response.text)
else:
# 流式处理大文件
for chunk in response.iter_bytes():
process(chunk)
异步流式
import httpx
import asyncio
async def download_file(url, filename):
async with httpx.AsyncClient() as client:
async with client.stream('GET', url) as response:
with open(filename, 'wb') as f:
async for chunk in response.aiter_bytes(chunk_size=8192):
f.write(chunk)
async def main():
await download_file(
'https://example.com/large-file.zip',
'large-file.zip'
)
asyncio.run(main())
异常处理
httpx 定义了清晰的异常层次结构:
import httpx
try:
response = httpx.get('https://example.com/api', timeout=5.0)
response.raise_for_status()
except httpx.TimeoutException as e:
print(f'请求超时: {e}')
except httpx.ConnectError as e:
print(f'连接失败: {e}')
except httpx.ReadError as e:
print(f'读取错误: {e}')
except httpx.WriteError as e:
print(f'写入错误: {e}')
except httpx.HTTPStatusError as e:
print(f'HTTP 状态错误: {e.response.status_code}')
print(f'请求 URL: {e.request.url}')
except httpx.RequestError as e:
# 所有请求错误的基类
print(f'请求错误: {e}')
except httpx.HTTPError as e:
# 所有 HTTP 错误的基类(包含 RequestError 和 HTTPStatusError)
print(f'HTTP 错误: {e}')
重定向
import httpx
# 默认不跟随重定向
response = httpx.get('http://github.com')
print(response.status_code) # 301
print(response.next_request) # 下一个请求
# 启用重定向跟随
response = httpx.get('http://github.com', follow_redirects=True)
print(response.status_code) # 200
print(response.url) # https://github.com
print(response.history) # [Response(301)]
# 在 Client 中设置
with httpx.Client(follow_redirects=True, max_redirects=10) as client:
response = client.get('http://github.com')
认证
Basic 认证
import httpx
# 元组形式
response = httpx.get('https://api.example.com', auth=('username', 'password'))
# httpx.BasicAuth 对象
auth = httpx.BasicAuth('username', 'password')
response = httpx.get('https://api.example.com', auth=auth)
Digest 认证
import httpx
auth = httpx.DigestAuth('username', 'password')
response = httpx.get('https://api.example.com', auth=auth)
Bearer Token
import httpx
headers = {'Authorization': 'Bearer your_token_here'}
response = httpx.get('https://api.example.com', headers=headers)
# 或使用 auth 参数自定义
class BearerAuth:
def __init__(self, token):
self.token = token
def __call__(self, request):
request.headers['Authorization'] = f'Bearer {self.token}'
return request
response = httpx.get('https://api.example.com', auth=BearerAuth('your_token'))
HTTP/2 支持
httpx 支持 HTTP/2 协议,需要安装 httpx[http2]:
import httpx
# 启用 HTTP/2
with httpx.Client(http2=True) as client:
response = client.get('https://nghttp2.org/httpbin/get')
print(response.http_version) # HTTP/2
# 异步客户端
async with httpx.AsyncClient(http2=True) as client:
response = await client.get('https://nghttp2.org/httpbin/get')
print(response.http_version) # HTTP/2
高级用法
重试机制
import httpx
# 使用 Transport 实现重试
transport = httpx.HTTPTransport(retries=3)
with httpx.Client(transport=transport) as client:
response = client.get('https://httpbin.org/get')
# 异步客户端
transport = httpx.AsyncHTTPTransport(retries=3)
async with httpx.AsyncClient(transport=transport) as client:
response = await client.get('https://httpbin.org/get')
自定义 Transport
import httpx
# 自定义连接池大小
limits = httpx.Limits(
max_keepalive_connections=20, # 保持活跃的连接数
max_connections=100, # 最大连接数
keepalive_expiry=30.0 # 保持活跃时间(秒)
)
with httpx.Client(limits=limits) as client:
response = client.get('https://httpbin.org/get')
# 自定义 Transport
transport = httpx.HTTPTransport(
retries=2,
http2=True
)
with httpx.Client(transport=transport) as client:
response = client.get('https://httpbin.org/get')
事件钩子
import httpx
def log_request(request):
print(f'请求: {request.method} {request.url}')
def log_response(response):
print(f'响应: {response.status_code}')
with httpx.Client(
event_hooks={
'request': [log_request],
'response': [log_response]
}
) as client:
response = client.get('https://httpbin.org/get')
完整爬虫示例
同步爬虫
import httpx
from bs4 import BeautifulSoup
import time
import random
class HttpxSpider:
def __init__(self):
self.client = httpx.Client(
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
},
timeout=30.0,
follow_redirects=True
)
def get(self, url, **kwargs):
"""发送 GET 请求"""
time.sleep(random.uniform(0.5, 1.5))
response = self.client.get(url, **kwargs)
response.raise_for_status()
return response
def parse_html(self, response):
"""解析 HTML"""
soup = BeautifulSoup(response.text, 'lxml')
return soup
def extract_links(self, soup):
"""提取所有链接"""
links = []
for a in soup.find_all('a', href=True):
links.append(a['href'])
return links
def close(self):
"""关闭客户端"""
self.client.close()
def __enter__(self):
return self
def __exit__(self, *args):
self.close()
# 使用示例
with HttpxSpider() as spider:
response = spider.get('https://httpbin.org/links/10/0')
soup = spider.parse_html(response)
links = spider.extract_links(soup)
print(f'找到 {len(links)} 个链接')
异步爬虫
import httpx
import asyncio
from bs4 import BeautifulSoup
import random
class AsyncHttpxSpider:
def __init__(self, max_concurrent=10, timeout=30.0):
self.max_concurrent = max_concurrent
self.timeout = httpx.Timeout(timeout)
self.semaphore = asyncio.Semaphore(max_concurrent)
self.client = None
async def __aenter__(self):
self.client = httpx.AsyncClient(
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
},
timeout=self.timeout,
follow_redirects=True,
http2=True
)
return self
async def __aexit__(self, *args):
await self.client.aclose()
async def fetch(self, url, **kwargs):
"""获取单个页面"""
async with self.semaphore:
await asyncio.sleep(random.uniform(0.1, 0.5))
response = await self.client.get(url, **kwargs)
response.raise_for_status()
return response
async def fetch_all(self, urls):
"""并发获取多个页面"""
tasks = [self.fetch(url) for url in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
def parse_html(self, response):
"""解析 HTML"""
return BeautifulSoup(response.text, 'lxml')
# 使用示例
async def main():
urls = [f'https://httpbin.org/delay/1?id={i}' for i in range(20)]
async with AsyncHttpxSpider(max_concurrent=5) as spider:
import time
start = time.time()
results = await spider.fetch_all(urls)
success = sum(1 for r in results if not isinstance(r, Exception))
print(f'完成: {success}/{len(urls)} 个请求')
print(f'耗时: {time.time() - start:.2f}s')
asyncio.run(main())
从 requests 迁移
httpx 的 API 与 requests 高度兼容,大多数情况下只需简单替换:
# requests
import requests
response = requests.get('https://httpbin.org/get')
data = response.json()
# httpx
import httpx
response = httpx.get('https://httpbin.org/get')
data = response.json()
主要差异
| 特性 | requests | httpx |
|---|---|---|
| 默认超时 | 无 | 5 秒 |
| 异步支持 | 无 | 原生支持 |
| HTTP/2 | 不支持 | 支持 |
| 响应编码 | response.encoding | response.encoding |
| JSON 解析 | response.json() | response.json() |
| 二进制内容 | response.content | response.content |
| 文本内容 | response.text | response.text |
迁移注意事项
- 超时设置:httpx 默认 5 秒超时,可能需要调整
- 重定向:httpx 默认不跟随重定向,需要显式设置
follow_redirects=True - Client 使用:推荐使用 Client 进行连接池管理
- 异常类型:异常类名称略有不同
小结
本章我们学习了:
- httpx 简介 - 现代化 HTTP 客户端,支持同步/异步和 HTTP/2
- 安装配置 - 基础安装和可选依赖
- 基本用法 - GET/POST 等请求方法
- 请求参数 - params、headers、cookies、请求体
- 响应处理 - 状态码、内容、JSON 解析
- Client 会话 - 连接复用、配置管理
- 超时设置 - 默认超时和精细控制
- 代理配置 - HTTP/SOCKS 代理
- 流式响应 - 大文件下载
- 异常处理 - 清晰的异常层次
- HTTP/2 支持 - 启用 HTTP/2 协议
练习
- 使用 httpx 实现一个支持重试的爬虫
- 使用 AsyncClient 并发爬取多个页面
- 使用流式响应下载大文件
- 从 requests 代码迁移到 httpx