跳到主要内容

httpx 库详解

httpx 是一个现代化的 Python HTTP 客户端库,它继承了 requests 的简洁 API,同时提供了原生异步支持和 HTTP/2 能力。对于需要高性能的爬虫项目,httpx 是 requests 的优秀替代品。

官方文档

本教程内容基于 httpx 官方文档

当前版本: v0.28.1 | Python 要求: 3.9+

为什么选择 httpx?

httpx vs requests

特性httpxrequests
同步 API
异步 API✅ 原生支持❌ 需要第三方库
HTTP/1.1
HTTP/2
连接池
超时设置默认 5 秒无默认值
类型注解✅ 完整
WebSocket

适用场景

  • 异步爬虫:需要高并发请求的场景
  • HTTP/2 支持:需要使用 HTTP/2 协议的网站
  • 现代 API 客户端:需要类型提示和更好的 IDE 支持
  • ASGI/WSGI 测试:直接调用 Python Web 应用
  • 长连接场景:WebSocket、Server-Sent Events

安装

# 基础安装
pip install httpx

# 包含 HTTP/2 支持
pip install httpx[http2]

# 包含 SOCKS 代理支持
pip install httpx[socks]

# 包含命令行工具
pip install httpx[cli]

# 包含所有可选依赖
pip install httpx[http2,socks,brotli,zstd]

命令行工具

安装 cli 依赖后,可以直接在命令行使用 httpx:

# 发送请求
httpx https://httpbin.org/get

# POST 请求
httpx POST https://httpbin.org/post key=value

# 查看帮助
httpx --help

快速开始

同步请求

httpx 的同步 API 与 requests 高度兼容:

import httpx

# GET 请求
response = httpx.get('https://httpbin.org/get')
print(response.status_code) # 200
print(response.text) # 响应文本
print(response.json()) # JSON 解析

# POST 请求
response = httpx.post(
'https://httpbin.org/post',
data={'key': 'value'}
)

# 其他请求方法
response = httpx.put('https://httpbin.org/put', data={'key': 'value'})
response = httpx.delete('https://httpbin.org/delete')
response = httpx.head('https://httpbin.org/get')
response = httpx.options('https://httpbin.org/get')
response = httpx.patch('https://httpbin.org/patch', data={'key': 'value'})

异步请求

httpx 原生支持异步,这是它相比 requests 的主要优势:

import httpx
import asyncio

async def fetch():
async with httpx.AsyncClient() as client:
response = await client.get('https://httpbin.org/get')
print(response.status_code)
print(response.json())

asyncio.run(fetch())

请求参数

URL 参数

使用 params 参数添加查询参数:

import httpx

# 字典形式
params = {'key1': 'value1', 'key2': 'value2'}
response = httpx.get('https://httpbin.org/get', params=params)
print(response.url) # https://httpbin.org/get?key1=value1&key2=value2

# 列表形式(多值参数)
params = {'key': ['value1', 'value2']}
response = httpx.get('https://httpbin.org/get', params=params)
print(response.url) # https://httpbin.org/get?key=value1&key=value2

# 元组列表形式
params = [('key1', 'value1'), ('key2', 'value2')]
response = httpx.get('https://httpbin.org/get', params=params)

请求头

使用 headers 参数设置请求头:

import httpx

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json',
'Authorization': 'Bearer token123',
}

response = httpx.get('https://httpbin.org/headers', headers=headers)
print(response.json())

使用 cookies 参数发送 Cookie:

import httpx

# 字典形式
cookies = {'session_id': 'abc123', 'user_token': 'xyz789'}
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)
print(response.json())

# httpx.Cookies 对象
cookies = httpx.Cookies()
cookies.set('cookie_name', 'cookie_value', domain='httpbin.org')
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)

# 获取响应中的 Cookie
response = httpx.get('https://httpbin.org/cookies/set?name=value')
print(response.cookies['name']) # value

请求体

表单数据

import httpx

# 表单数据
data = {'username': 'admin', 'password': '123456'}
response = httpx.post('https://httpbin.org/post', data=data)
print(response.json()['form'])

# 多值表单
data = {'items': ['item1', 'item2', 'item3']}
response = httpx.post('https://httpbin.org/post', data=data)

JSON 数据

import httpx

# JSON 数据(自动设置 Content-Type)
json_data = {'name': '张三', 'age': 25, 'hobbies': ['Python', '爬虫']}
response = httpx.post('https://httpbin.org/post', json=json_data)
print(response.json()['json'])

文件上传

import httpx

# 上传文件
with open('report.pdf', 'rb') as f:
files = {'file': f}
response = httpx.post('https://httpbin.org/post', files=files)

# 指定文件名和类型
with open('report.pdf', 'rb') as f:
files = {'file': ('report.pdf', f, 'application/pdf')}
response = httpx.post('https://httpbin.org/post', files=files)

# 多文件上传
files = [
('files', ('file1.txt', open('file1.txt', 'rb'))),
('files', ('file2.txt', open('file2.txt', 'rb'))),
]
response = httpx.post('https://httpbin.org/post', files=files)

# 文件和表单混合
data = {'description': '文件描述'}
with open('report.pdf', 'rb') as f:
files = {'file': f}
response = httpx.post('https://httpbin.org/post', data=data, files=files)

二进制数据

import httpx

# 发送二进制数据
content = b'Hello, World!'
response = httpx.post(
'https://httpbin.org/post',
content=content,
headers={'Content-Type': 'application/octet-stream'}
)

响应处理

响应属性

import httpx

response = httpx.get('https://httpbin.org/get')

# 状态码
print(response.status_code) # 200
print(response.reason_phrase) # 'OK'

# 判断状态码
print(response.is_success) # 2xx
print(response.is_redirect) # 3xx
print(response.is_client_error) # 4xx
print(response.is_server_error) # 5xx

# 响应头
print(response.headers)
print(response.headers['content-type'])
print(response.headers.get('Content-Type'))

# URL 信息
print(response.url) # 最终 URL
print(response.request) # 请求对象

# 编码
print(response.encoding) # 编码方式
response.encoding = 'utf-8' # 手动设置编码

响应内容

import httpx

response = httpx.get('https://httpbin.org/get')

# 文本内容(自动解码)
print(response.text)

# 字节内容
print(response.content)

# JSON 解析
print(response.json())

# 流式读取(大文件)
with httpx.stream('GET', 'https://example.com/large-file') as response:
for chunk in response.iter_bytes(chunk_size=8192):
# 处理数据块
process(chunk)

状态码处理

import httpx

# 检查状态码
response = httpx.get('https://httpbin.org/get')
if response.status_code == httpx.codes.OK:
print('请求成功')

# 抛出异常(非 2xx 状态码)
try:
response = httpx.get('https://httpbin.org/status/404')
response.raise_for_status()
except httpx.HTTPStatusError as e:
print(f'HTTP 错误: {e.response.status_code}')
print(f'请求 URL: {e.request.url}')

# 链式调用
data = httpx.get('https://api.example.com/data').raise_for_status().json()

Client 会话

使用 Client 可以实现连接复用、Cookie 持久化和默认配置。

同步 Client

import httpx

# 使用上下文管理器(推荐)
with httpx.Client() as client:
# 所有请求共享连接池和 Cookie
response = client.get('https://httpbin.org/get')
response = client.post('https://httpbin.org/post', json={'key': 'value'})

# 手动管理
client = httpx.Client()
try:
response = client.get('https://httpbin.org/get')
finally:
client.close()

配置 Client

import httpx

# 基本配置
client = httpx.Client(
base_url='https://api.example.com', # 基础 URL
headers={'User-Agent': 'MyBot/1.0'}, # 默认请求头
cookies={'session': 'abc123'}, # 默认 Cookie
timeout=30.0, # 超时时间
follow_redirects=True, # 跟随重定向
max_redirects=20, # 最大重定向次数
http2=True, # 启用 HTTP/2
verify=True, # SSL 验证
)

with client:
# 相对于 base_url 的请求
response = client.get('/users') # https://api.example.com/users
response = client.post('/login', json={'user': 'admin', 'pass': '123'})

异步 Client

import httpx
import asyncio

async def main():
# 异步客户端
async with httpx.AsyncClient() as client:
response = await client.get('https://httpbin.org/get')
print(response.json())

# 带配置的异步客户端
async with httpx.AsyncClient(
base_url='https://api.example.com',
timeout=30.0,
http2=True
) as client:
# 并发请求
tasks = [
client.get(f'/users/{i}')
for i in range(10)
]
responses = await asyncio.gather(*tasks)

for response in responses:
print(response.json())

asyncio.run(main())

超时设置

httpx 默认有 5 秒的超时,这是与 requests 的重要区别。

基本超时

import httpx

# 全局超时
response = httpx.get('https://httpbin.org/delay/1', timeout=10.0)

# 禁用超时
response = httpx.get('https://httpbin.org/delay/1', timeout=None)

# 在 Client 中设置
with httpx.Client(timeout=30.0) as client:
response = client.get('https://httpbin.org/get')

精细超时控制

import httpx

# 不同阶段的超时
timeout = httpx.Timeout(
connect=5.0, # 连接超时
read=10.0, # 读取超时
write=10.0, # 写入超时
pool=5.0 # 连接池等待超时
)

response = httpx.get('https://httpbin.org/get', timeout=timeout)

# 简写形式(connect 和 read 使用相同值)
timeout = httpx.Timeout(10.0) # connect=10.0, read=10.0

# 在 Client 中使用
with httpx.Client(timeout=timeout) as client:
response = client.get('https://httpbin.org/get')

代理设置

HTTP 代理

import httpx

# 单个代理
proxies = {
'http://': 'http://127.0.0.1:7890',
'https://': 'http://127.0.0.1:7890'
}
response = httpx.get('https://httpbin.org/ip', proxies=proxies)

# 简写形式
response = httpx.get('https://httpbin.org/ip', proxy='http://127.0.0.1:7890')

# 认证代理
proxies = {
'http://': 'http://user:[email protected]:8080'
}

# 在 Client 中设置
with httpx.Client(proxies=proxies) as client:
response = client.get('https://httpbin.org/ip')

SOCKS 代理

需要安装 httpx[socks]

import httpx

# SOCKS5 代理
proxies = {
'all://': 'socks5://127.0.0.1:1080'
}

response = httpx.get('https://httpbin.org/ip', proxies=proxies)

# 认证 SOCKS 代理
proxies = {
'all://': 'socks5://user:[email protected]:1080'
}

代理池

import httpx
import random

proxies_list = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]

def get_random_proxy():
return random.choice(proxies_list)

# 每次请求使用不同代理
for url in urls:
proxy = get_random_proxy()
response = httpx.get(url, proxy=proxy)

流式响应

对于大文件下载,使用流式响应避免内存溢出:

同步流式

import httpx

# 流式下载文件
with httpx.stream('GET', 'https://example.com/large-file.zip') as response:
with open('large-file.zip', 'wb') as f:
for chunk in response.iter_bytes(chunk_size=8192):
f.write(chunk)

# 流式读取文本
with httpx.stream('GET', 'https://example.com/large-text.txt') as response:
for line in response.iter_lines():
print(line)

# 条件读取
with httpx.stream('GET', 'https://example.com/file') as response:
content_length = int(response.headers.get('content-length', 0))
if content_length < 10_000_000: # 小于 10MB
response.read()
print(response.text)
else:
# 流式处理大文件
for chunk in response.iter_bytes():
process(chunk)

异步流式

import httpx
import asyncio

async def download_file(url, filename):
async with httpx.AsyncClient() as client:
async with client.stream('GET', url) as response:
with open(filename, 'wb') as f:
async for chunk in response.aiter_bytes(chunk_size=8192):
f.write(chunk)

async def main():
await download_file(
'https://example.com/large-file.zip',
'large-file.zip'
)

asyncio.run(main())

异常处理

httpx 定义了清晰的异常层次结构:

import httpx

try:
response = httpx.get('https://example.com/api', timeout=5.0)
response.raise_for_status()

except httpx.TimeoutException as e:
print(f'请求超时: {e}')

except httpx.ConnectError as e:
print(f'连接失败: {e}')

except httpx.ReadError as e:
print(f'读取错误: {e}')

except httpx.WriteError as e:
print(f'写入错误: {e}')

except httpx.HTTPStatusError as e:
print(f'HTTP 状态错误: {e.response.status_code}')
print(f'请求 URL: {e.request.url}')

except httpx.RequestError as e:
# 所有请求错误的基类
print(f'请求错误: {e}')

except httpx.HTTPError as e:
# 所有 HTTP 错误的基类(包含 RequestError 和 HTTPStatusError)
print(f'HTTP 错误: {e}')

重定向

import httpx

# 默认不跟随重定向
response = httpx.get('http://github.com')
print(response.status_code) # 301
print(response.next_request) # 下一个请求

# 启用重定向跟随
response = httpx.get('http://github.com', follow_redirects=True)
print(response.status_code) # 200
print(response.url) # https://github.com
print(response.history) # [Response(301)]

# 在 Client 中设置
with httpx.Client(follow_redirects=True, max_redirects=10) as client:
response = client.get('http://github.com')

认证

Basic 认证

import httpx

# 元组形式
response = httpx.get('https://api.example.com', auth=('username', 'password'))

# httpx.BasicAuth 对象
auth = httpx.BasicAuth('username', 'password')
response = httpx.get('https://api.example.com', auth=auth)

Digest 认证

import httpx

auth = httpx.DigestAuth('username', 'password')
response = httpx.get('https://api.example.com', auth=auth)

Bearer Token

import httpx

headers = {'Authorization': 'Bearer your_token_here'}
response = httpx.get('https://api.example.com', headers=headers)

# 或使用 auth 参数自定义
class BearerAuth:
def __init__(self, token):
self.token = token

def __call__(self, request):
request.headers['Authorization'] = f'Bearer {self.token}'
return request

response = httpx.get('https://api.example.com', auth=BearerAuth('your_token'))

HTTP/2 支持

httpx 支持 HTTP/2 协议,需要安装 httpx[http2]

import httpx

# 启用 HTTP/2
with httpx.Client(http2=True) as client:
response = client.get('https://nghttp2.org/httpbin/get')
print(response.http_version) # HTTP/2

# 异步客户端
async with httpx.AsyncClient(http2=True) as client:
response = await client.get('https://nghttp2.org/httpbin/get')
print(response.http_version) # HTTP/2

高级用法

重试机制

import httpx

# 使用 Transport 实现重试
transport = httpx.HTTPTransport(retries=3)

with httpx.Client(transport=transport) as client:
response = client.get('https://httpbin.org/get')

# 异步客户端
transport = httpx.AsyncHTTPTransport(retries=3)

async with httpx.AsyncClient(transport=transport) as client:
response = await client.get('https://httpbin.org/get')

自定义 Transport

import httpx

# 自定义连接池大小
limits = httpx.Limits(
max_keepalive_connections=20, # 保持活跃的连接数
max_connections=100, # 最大连接数
keepalive_expiry=30.0 # 保持活跃时间(秒)
)

with httpx.Client(limits=limits) as client:
response = client.get('https://httpbin.org/get')

# 自定义 Transport
transport = httpx.HTTPTransport(
retries=2,
http2=True
)

with httpx.Client(transport=transport) as client:
response = client.get('https://httpbin.org/get')

事件钩子

import httpx

def log_request(request):
print(f'请求: {request.method} {request.url}')

def log_response(response):
print(f'响应: {response.status_code}')

with httpx.Client(
event_hooks={
'request': [log_request],
'response': [log_response]
}
) as client:
response = client.get('https://httpbin.org/get')

完整爬虫示例

同步爬虫

import httpx
from bs4 import BeautifulSoup
import time
import random

class HttpxSpider:
def __init__(self):
self.client = httpx.Client(
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
},
timeout=30.0,
follow_redirects=True
)

def get(self, url, **kwargs):
"""发送 GET 请求"""
time.sleep(random.uniform(0.5, 1.5))
response = self.client.get(url, **kwargs)
response.raise_for_status()
return response

def parse_html(self, response):
"""解析 HTML"""
soup = BeautifulSoup(response.text, 'lxml')
return soup

def extract_links(self, soup):
"""提取所有链接"""
links = []
for a in soup.find_all('a', href=True):
links.append(a['href'])
return links

def close(self):
"""关闭客户端"""
self.client.close()

def __enter__(self):
return self

def __exit__(self, *args):
self.close()


# 使用示例
with HttpxSpider() as spider:
response = spider.get('https://httpbin.org/links/10/0')
soup = spider.parse_html(response)
links = spider.extract_links(soup)
print(f'找到 {len(links)} 个链接')

异步爬虫

import httpx
import asyncio
from bs4 import BeautifulSoup
import random

class AsyncHttpxSpider:
def __init__(self, max_concurrent=10, timeout=30.0):
self.max_concurrent = max_concurrent
self.timeout = httpx.Timeout(timeout)
self.semaphore = asyncio.Semaphore(max_concurrent)
self.client = None

async def __aenter__(self):
self.client = httpx.AsyncClient(
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
},
timeout=self.timeout,
follow_redirects=True,
http2=True
)
return self

async def __aexit__(self, *args):
await self.client.aclose()

async def fetch(self, url, **kwargs):
"""获取单个页面"""
async with self.semaphore:
await asyncio.sleep(random.uniform(0.1, 0.5))
response = await self.client.get(url, **kwargs)
response.raise_for_status()
return response

async def fetch_all(self, urls):
"""并发获取多个页面"""
tasks = [self.fetch(url) for url in urls]
return await asyncio.gather(*tasks, return_exceptions=True)

def parse_html(self, response):
"""解析 HTML"""
return BeautifulSoup(response.text, 'lxml')


# 使用示例
async def main():
urls = [f'https://httpbin.org/delay/1?id={i}' for i in range(20)]

async with AsyncHttpxSpider(max_concurrent=5) as spider:
import time
start = time.time()

results = await spider.fetch_all(urls)

success = sum(1 for r in results if not isinstance(r, Exception))
print(f'完成: {success}/{len(urls)} 个请求')
print(f'耗时: {time.time() - start:.2f}s')

asyncio.run(main())

从 requests 迁移

httpx 的 API 与 requests 高度兼容,大多数情况下只需简单替换:

# requests
import requests
response = requests.get('https://httpbin.org/get')
data = response.json()

# httpx
import httpx
response = httpx.get('https://httpbin.org/get')
data = response.json()

主要差异

特性requestshttpx
默认超时5 秒
异步支持原生支持
HTTP/2不支持支持
响应编码response.encodingresponse.encoding
JSON 解析response.json()response.json()
二进制内容response.contentresponse.content
文本内容response.textresponse.text

迁移注意事项

  1. 超时设置:httpx 默认 5 秒超时,可能需要调整
  2. 重定向:httpx 默认不跟随重定向,需要显式设置 follow_redirects=True
  3. Client 使用:推荐使用 Client 进行连接池管理
  4. 异常类型:异常类名称略有不同

小结

本章我们学习了:

  1. httpx 简介 - 现代化 HTTP 客户端,支持同步/异步和 HTTP/2
  2. 安装配置 - 基础安装和可选依赖
  3. 基本用法 - GET/POST 等请求方法
  4. 请求参数 - params、headers、cookies、请求体
  5. 响应处理 - 状态码、内容、JSON 解析
  6. Client 会话 - 连接复用、配置管理
  7. 超时设置 - 默认超时和精细控制
  8. 代理配置 - HTTP/SOCKS 代理
  9. 流式响应 - 大文件下载
  10. 异常处理 - 清晰的异常层次
  11. HTTP/2 支持 - 启用 HTTP/2 协议

练习

  1. 使用 httpx 实现一个支持重试的爬虫
  2. 使用 AsyncClient 并发爬取多个页面
  3. 使用流式响应下载大文件
  4. 从 requests 代码迁移到 httpx