httpx 库详解

httpx 是一个现代化的 Python HTTP 客户端库，它继承了 requests 的简洁 API，同时提供了原生异步支持和 HTTP/2 能力。对于需要高性能的爬虫项目，httpx 是 requests 的优秀替代品。

官方文档

当前版本: v0.28.1 | Python 要求: 3.9+

为什么选择 httpx？

httpx vs requests

特性	httpx	requests
同步 API	✅	✅
异步 API	✅ 原生支持	❌ 需要第三方库
HTTP/1.1	✅	✅
HTTP/2	✅	❌
连接池	✅	✅
超时设置	默认 5 秒	无默认值
类型注解	✅ 完整	❌
WebSocket	✅	❌

适用场景

异步爬虫：需要高并发请求的场景
HTTP/2 支持：需要使用 HTTP/2 协议的网站
现代 API 客户端：需要类型提示和更好的 IDE 支持
ASGI/WSGI 测试：直接调用 Python Web 应用
长连接场景：WebSocket、Server-Sent Events

安装

# 基础安装
pip install httpx

# 包含 HTTP/2 支持
pip install httpx[http2]

# 包含 SOCKS 代理支持
pip install httpx[socks]

# 包含命令行工具
pip install httpx[cli]

# 包含所有可选依赖
pip install httpx[http2,socks,brotli,zstd]

命令行工具

安装 cli 依赖后，可以直接在命令行使用 httpx：

# 发送请求
httpx https://httpbin.org/get

# POST 请求
httpx POST https://httpbin.org/post key=value

# 查看帮助
httpx --help

快速开始

同步请求

httpx 的同步 API 与 requests 高度兼容：

import httpx

# GET 请求
response = httpx.get('https://httpbin.org/get')
print(response.status_code)  # 200
print(response.text)          # 响应文本
print(response.json())        # JSON 解析

# POST 请求
response = httpx.post(
    'https://httpbin.org/post',
    data={'key': 'value'}
)

# 其他请求方法
response = httpx.put('https://httpbin.org/put', data={'key': 'value'})
response = httpx.delete('https://httpbin.org/delete')
response = httpx.head('https://httpbin.org/get')
response = httpx.options('https://httpbin.org/get')
response = httpx.patch('https://httpbin.org/patch', data={'key': 'value'})

异步请求

httpx 原生支持异步，这是它相比 requests 的主要优势：

import httpx
import asyncio

async def fetch():
    async with httpx.AsyncClient() as client:
        response = await client.get('https://httpbin.org/get')
        print(response.status_code)
        print(response.json())

asyncio.run(fetch())

请求参数

URL 参数

使用 params 参数添加查询参数：

import httpx

# 字典形式
params = {'key1': 'value1', 'key2': 'value2'}
response = httpx.get('https://httpbin.org/get', params=params)
print(response.url)  # https://httpbin.org/get?key1=value1&key2=value2

# 列表形式（多值参数）
params = {'key': ['value1', 'value2']}
response = httpx.get('https://httpbin.org/get', params=params)
print(response.url)  # https://httpbin.org/get?key=value1&key=value2

# 元组列表形式
params = [('key1', 'value1'), ('key2', 'value2')]
response = httpx.get('https://httpbin.org/get', params=params)

请求头

使用 headers 参数设置请求头：

import httpx

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json',
    'Authorization': 'Bearer token123',
}

response = httpx.get('https://httpbin.org/headers', headers=headers)
print(response.json())

使用 cookies 参数发送 Cookie：

import httpx

# 字典形式
cookies = {'session_id': 'abc123', 'user_token': 'xyz789'}
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)
print(response.json())

# httpx.Cookies 对象
cookies = httpx.Cookies()
cookies.set('cookie_name', 'cookie_value', domain='httpbin.org')
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)

# 获取响应中的 Cookie
response = httpx.get('https://httpbin.org/cookies/set?name=value')
print(response.cookies['name'])  # value

请求体

表单数据

import httpx

# 表单数据
data = {'username': 'admin', 'password': '123456'}
response = httpx.post('https://httpbin.org/post', data=data)
print(response.json()['form'])

# 多值表单
data = {'items': ['item1', 'item2', 'item3']}
response = httpx.post('https://httpbin.org/post', data=data)

JSON 数据

import httpx

# JSON 数据（自动设置 Content-Type）
json_data = {'name': '张三', 'age': 25, 'hobbies': ['Python', '爬虫']}
response = httpx.post('https://httpbin.org/post', json=json_data)
print(response.json()['json'])

文件上传

import httpx

# 上传文件
with open('report.pdf', 'rb') as f:
    files = {'file': f}
    response = httpx.post('https://httpbin.org/post', files=files)

# 指定文件名和类型
with open('report.pdf', 'rb') as f:
    files = {'file': ('report.pdf', f, 'application/pdf')}
    response = httpx.post('https://httpbin.org/post', files=files)

# 多文件上传
files = [
    ('files', ('file1.txt', open('file1.txt', 'rb'))),
    ('files', ('file2.txt', open('file2.txt', 'rb'))),
]
response = httpx.post('https://httpbin.org/post', files=files)

# 文件和表单混合
data = {'description': '文件描述'}
with open('report.pdf', 'rb') as f:
    files = {'file': f}
    response = httpx.post('https://httpbin.org/post', data=data, files=files)

二进制数据

import httpx

# 发送二进制数据
content = b'Hello, World!'
response = httpx.post(
    'https://httpbin.org/post',
    content=content,
    headers={'Content-Type': 'application/octet-stream'}
)

响应处理

响应属性

import httpx

response = httpx.get('https://httpbin.org/get')

# 状态码
print(response.status_code)        # 200
print(response.reason_phrase)      # 'OK'

# 判断状态码
print(response.is_success)         # 2xx
print(response.is_redirect)        # 3xx
print(response.is_client_error)    # 4xx
print(response.is_server_error)    # 5xx

# 响应头
print(response.headers)
print(response.headers['content-type'])
print(response.headers.get('Content-Type'))

# URL 信息
print(response.url)               # 最终 URL
print(response.request)           # 请求对象

# 编码
print(response.encoding)          # 编码方式
response.encoding = 'utf-8'       # 手动设置编码

响应内容

import httpx

response = httpx.get('https://httpbin.org/get')

# 文本内容（自动解码）
print(response.text)

# 字节内容
print(response.content)

# JSON 解析
print(response.json())

# 流式读取（大文件）
with httpx.stream('GET', 'https://example.com/large-file') as response:
    for chunk in response.iter_bytes(chunk_size=8192):
        # 处理数据块
        process(chunk)

状态码处理

import httpx

# 检查状态码
response = httpx.get('https://httpbin.org/get')
if response.status_code == httpx.codes.OK:
    print('请求成功')

# 抛出异常（非 2xx 状态码）
try:
    response = httpx.get('https://httpbin.org/status/404')
    response.raise_for_status()
except httpx.HTTPStatusError as e:
    print(f'HTTP 错误: {e.response.status_code}')
    print(f'请求 URL: {e.request.url}')

# 链式调用
data = httpx.get('https://api.example.com/data').raise_for_status().json()

Client 会话

使用 Client 可以实现连接复用、Cookie 持久化和默认配置。

同步 Client

import httpx

# 使用上下文管理器（推荐）
with httpx.Client() as client:
    # 所有请求共享连接池和 Cookie
    response = client.get('https://httpbin.org/get')
    response = client.post('https://httpbin.org/post', json={'key': 'value'})

# 手动管理
client = httpx.Client()
try:
    response = client.get('https://httpbin.org/get')
finally:
    client.close()

配置 Client

import httpx

# 基本配置
client = httpx.Client(
    base_url='https://api.example.com',  # 基础 URL
    headers={'User-Agent': 'MyBot/1.0'}, # 默认请求头
    cookies={'session': 'abc123'},       # 默认 Cookie
    timeout=30.0,                        # 超时时间
    follow_redirects=True,               # 跟随重定向
    max_redirects=20,                    # 最大重定向次数
    http2=True,                          # 启用 HTTP/2
    verify=True,                         # SSL 验证
)

with client:
    # 相对于 base_url 的请求
    response = client.get('/users')  # https://api.example.com/users
    response = client.post('/login', json={'user': 'admin', 'pass': '123'})

异步 Client

import httpx
import asyncio

async def main():
    # 异步客户端
    async with httpx.AsyncClient() as client:
        response = await client.get('https://httpbin.org/get')
        print(response.json())

    # 带配置的异步客户端
    async with httpx.AsyncClient(
        base_url='https://api.example.com',
        timeout=30.0,
        http2=True
    ) as client:
        # 并发请求
        tasks = [
            client.get(f'/users/{i}')
            for i in range(10)
        ]
        responses = await asyncio.gather(*tasks)
        
        for response in responses:
            print(response.json())

asyncio.run(main())

超时设置

httpx 默认有 5 秒的超时，这是与 requests 的重要区别。

基本超时

import httpx

# 全局超时
response = httpx.get('https://httpbin.org/delay/1', timeout=10.0)

# 禁用超时
response = httpx.get('https://httpbin.org/delay/1', timeout=None)

# 在 Client 中设置
with httpx.Client(timeout=30.0) as client:
    response = client.get('https://httpbin.org/get')

精细超时控制

import httpx

# 不同阶段的超时
timeout = httpx.Timeout(
    connect=5.0,      # 连接超时
    read=10.0,        # 读取超时
    write=10.0,       # 写入超时
    pool=5.0          # 连接池等待超时
)

response = httpx.get('https://httpbin.org/get', timeout=timeout)

# 简写形式（connect 和 read 使用相同值）
timeout = httpx.Timeout(10.0)  # connect=10.0, read=10.0

# 在 Client 中使用
with httpx.Client(timeout=timeout) as client:
    response = client.get('https://httpbin.org/get')

代理设置

HTTP 代理

import httpx

# 单个代理
proxies = {
    'http://': 'http://127.0.0.1:7890',
    'https://': 'http://127.0.0.1:7890'
}
response = httpx.get('https://httpbin.org/ip', proxies=proxies)

# 简写形式
response = httpx.get('https://httpbin.org/ip', proxy='http://127.0.0.1:7890')

# 认证代理
proxies = {
    'http://': 'http://user:[email protected]:8080'
}

# 在 Client 中设置
with httpx.Client(proxies=proxies) as client:
    response = client.get('https://httpbin.org/ip')

SOCKS 代理

需要安装 httpx[socks]：

import httpx

# SOCKS5 代理
proxies = {
    'all://': 'socks5://127.0.0.1:1080'
}

response = httpx.get('https://httpbin.org/ip', proxies=proxies)

# 认证 SOCKS 代理
proxies = {
    'all://': 'socks5://user:[email protected]:1080'
}

代理池

import httpx
import random

proxies_list = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    'http://proxy3.example.com:8080',
]

def get_random_proxy():
    return random.choice(proxies_list)

# 每次请求使用不同代理
for url in urls:
    proxy = get_random_proxy()
    response = httpx.get(url, proxy=proxy)

流式响应

对于大文件下载，使用流式响应避免内存溢出：

同步流式

import httpx

# 流式下载文件
with httpx.stream('GET', 'https://example.com/large-file.zip') as response:
    with open('large-file.zip', 'wb') as f:
        for chunk in response.iter_bytes(chunk_size=8192):
            f.write(chunk)

# 流式读取文本
with httpx.stream('GET', 'https://example.com/large-text.txt') as response:
    for line in response.iter_lines():
        print(line)

# 条件读取
with httpx.stream('GET', 'https://example.com/file') as response:
    content_length = int(response.headers.get('content-length', 0))
    if content_length < 10_000_000:  # 小于 10MB
        response.read()
        print(response.text)
    else:
        # 流式处理大文件
        for chunk in response.iter_bytes():
            process(chunk)

异步流式

import httpx
import asyncio

async def download_file(url, filename):
    async with httpx.AsyncClient() as client:
        async with client.stream('GET', url) as response:
            with open(filename, 'wb') as f:
                async for chunk in response.aiter_bytes(chunk_size=8192):
                    f.write(chunk)

async def main():
    await download_file(
        'https://example.com/large-file.zip',
        'large-file.zip'
    )

asyncio.run(main())

异常处理

httpx 定义了清晰的异常层次结构：

import httpx

try:
    response = httpx.get('https://example.com/api', timeout=5.0)
    response.raise_for_status()
    
except httpx.TimeoutException as e:
    print(f'请求超时: {e}')
    
except httpx.ConnectError as e:
    print(f'连接失败: {e}')
    
except httpx.ReadError as e:
    print(f'读取错误: {e}')
    
except httpx.WriteError as e:
    print(f'写入错误: {e}')
    
except httpx.HTTPStatusError as e:
    print(f'HTTP 状态错误: {e.response.status_code}')
    print(f'请求 URL: {e.request.url}')
    
except httpx.RequestError as e:
    # 所有请求错误的基类
    print(f'请求错误: {e}')
    
except httpx.HTTPError as e:
    # 所有 HTTP 错误的基类（包含 RequestError 和 HTTPStatusError）
    print(f'HTTP 错误: {e}')

重定向

import httpx

# 默认不跟随重定向
response = httpx.get('http://github.com')
print(response.status_code)  # 301
print(response.next_request) # 下一个请求

# 启用重定向跟随
response = httpx.get('http://github.com', follow_redirects=True)
print(response.status_code)  # 200
print(response.url)          # https://github.com
print(response.history)      # [Response(301)]

# 在 Client 中设置
with httpx.Client(follow_redirects=True, max_redirects=10) as client:
    response = client.get('http://github.com')

认证

Basic 认证

import httpx

# 元组形式
response = httpx.get('https://api.example.com', auth=('username', 'password'))

# httpx.BasicAuth 对象
auth = httpx.BasicAuth('username', 'password')
response = httpx.get('https://api.example.com', auth=auth)

Digest 认证

import httpx

auth = httpx.DigestAuth('username', 'password')
response = httpx.get('https://api.example.com', auth=auth)

Bearer Token

import httpx

headers = {'Authorization': 'Bearer your_token_here'}
response = httpx.get('https://api.example.com', headers=headers)

# 或使用 auth 参数自定义
class BearerAuth:
    def __init__(self, token):
        self.token = token
    
    def __call__(self, request):
        request.headers['Authorization'] = f'Bearer {self.token}'
        return request

response = httpx.get('https://api.example.com', auth=BearerAuth('your_token'))

HTTP/2 支持

httpx 支持 HTTP/2 协议，需要安装 httpx[http2]：

import httpx

# 启用 HTTP/2
with httpx.Client(http2=True) as client:
    response = client.get('https://nghttp2.org/httpbin/get')
    print(response.http_version)  # HTTP/2

# 异步客户端
async with httpx.AsyncClient(http2=True) as client:
    response = await client.get('https://nghttp2.org/httpbin/get')
    print(response.http_version)  # HTTP/2

高级用法

重试机制

import httpx

# 使用 Transport 实现重试
transport = httpx.HTTPTransport(retries=3)

with httpx.Client(transport=transport) as client:
    response = client.get('https://httpbin.org/get')

# 异步客户端
transport = httpx.AsyncHTTPTransport(retries=3)

async with httpx.AsyncClient(transport=transport) as client:
    response = await client.get('https://httpbin.org/get')

自定义 Transport

import httpx

# 自定义连接池大小
limits = httpx.Limits(
    max_keepalive_connections=20,  # 保持活跃的连接数
    max_connections=100,           # 最大连接数
    keepalive_expiry=30.0          # 保持活跃时间（秒）
)

with httpx.Client(limits=limits) as client:
    response = client.get('https://httpbin.org/get')

# 自定义 Transport
transport = httpx.HTTPTransport(
    retries=2,
    http2=True
)

with httpx.Client(transport=transport) as client:
    response = client.get('https://httpbin.org/get')

事件钩子

import httpx

def log_request(request):
    print(f'请求: {request.method} {request.url}')

def log_response(response):
    print(f'响应: {response.status_code}')

with httpx.Client(
    event_hooks={
        'request': [log_request],
        'response': [log_response]
    }
) as client:
    response = client.get('https://httpbin.org/get')

完整爬虫示例

同步爬虫

import httpx
from bs4 import BeautifulSoup
import time
import random

class HttpxSpider:
    def __init__(self):
        self.client = httpx.Client(
            headers={
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            },
            timeout=30.0,
            follow_redirects=True
        )
    
    def get(self, url, **kwargs):
        """发送 GET 请求"""
        time.sleep(random.uniform(0.5, 1.5))
        response = self.client.get(url, **kwargs)
        response.raise_for_status()
        return response
    
    def parse_html(self, response):
        """解析 HTML"""
        soup = BeautifulSoup(response.text, 'lxml')
        return soup
    
    def extract_links(self, soup):
        """提取所有链接"""
        links = []
        for a in soup.find_all('a', href=True):
            links.append(a['href'])
        return links
    
    def close(self):
        """关闭客户端"""
        self.client.close()
    
    def __enter__(self):
        return self
    
    def __exit__(self, *args):
        self.close()


# 使用示例
with HttpxSpider() as spider:
    response = spider.get('https://httpbin.org/links/10/0')
    soup = spider.parse_html(response)
    links = spider.extract_links(soup)
    print(f'找到 {len(links)} 个链接')

异步爬虫

import httpx
import asyncio
from bs4 import BeautifulSoup
import random

class AsyncHttpxSpider:
    def __init__(self, max_concurrent=10, timeout=30.0):
        self.max_concurrent = max_concurrent
        self.timeout = httpx.Timeout(timeout)
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.client = None
    
    async def __aenter__(self):
        self.client = httpx.AsyncClient(
            headers={
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            },
            timeout=self.timeout,
            follow_redirects=True,
            http2=True
        )
        return self
    
    async def __aexit__(self, *args):
        await self.client.aclose()
    
    async def fetch(self, url, **kwargs):
        """获取单个页面"""
        async with self.semaphore:
            await asyncio.sleep(random.uniform(0.1, 0.5))
            response = await self.client.get(url, **kwargs)
            response.raise_for_status()
            return response
    
    async def fetch_all(self, urls):
        """并发获取多个页面"""
        tasks = [self.fetch(url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)
    
    def parse_html(self, response):
        """解析 HTML"""
        return BeautifulSoup(response.text, 'lxml')


# 使用示例
async def main():
    urls = [f'https://httpbin.org/delay/1?id={i}' for i in range(20)]
    
    async with AsyncHttpxSpider(max_concurrent=5) as spider:
        import time
        start = time.time()
        
        results = await spider.fetch_all(urls)
        
        success = sum(1 for r in results if not isinstance(r, Exception))
        print(f'完成: {success}/{len(urls)} 个请求')
        print(f'耗时: {time.time() - start:.2f}s')

asyncio.run(main())

从 requests 迁移

httpx 的 API 与 requests 高度兼容，大多数情况下只需简单替换：

# requests
import requests
response = requests.get('https://httpbin.org/get')
data = response.json()

# httpx
import httpx
response = httpx.get('https://httpbin.org/get')
data = response.json()

主要差异

特性	requests	httpx
默认超时	无	5 秒
异步支持	无	原生支持
HTTP/2	不支持	支持
响应编码	`response.encoding`	`response.encoding`
JSON 解析	`response.json()`	`response.json()`
二进制内容	`response.content`	`response.content`
文本内容	`response.text`	`response.text`

迁移注意事项

超时设置：httpx 默认 5 秒超时，可能需要调整
重定向：httpx 默认不跟随重定向，需要显式设置 follow_redirects=True
Client 使用：推荐使用 Client 进行连接池管理
异常类型：异常类名称略有不同

小结

本章我们学习了：

httpx 简介 - 现代化 HTTP 客户端，支持同步/异步和 HTTP/2
安装配置 - 基础安装和可选依赖
基本用法 - GET/POST 等请求方法
请求参数 - params、headers、cookies、请求体
响应处理 - 状态码、内容、JSON 解析
Client 会话 - 连接复用、配置管理
超时设置 - 默认超时和精细控制
代理配置 - HTTP/SOCKS 代理
流式响应 - 大文件下载
异常处理 - 清晰的异常层次
HTTP/2 支持 - 启用 HTTP/2 协议

练习

使用 httpx 实现一个支持重试的爬虫
使用 AsyncClient 并发爬取多个页面
使用流式响应下载大文件
从 requests 代码迁移到 httpx

为什么选择 httpx？​

httpx vs requests​

适用场景​

安装​

命令行工具​

快速开始​

同步请求​

异步请求​

请求参数​

URL 参数​

请求头​

Cookie​

请求体​

表单数据​

JSON 数据​

文件上传​

二进制数据​

响应处理​

响应属性​

响应内容​

状态码处理​

Client 会话​

同步 Client​

配置 Client​

异步 Client​

超时设置​

基本超时​

精细超时控制​

代理设置​

HTTP 代理​

SOCKS 代理​

代理池​

流式响应​

同步流式​

异步流式​

异常处理​

重定向​

认证​

Basic 认证​

Digest 认证​

Bearer Token​

HTTP/2 支持​

高级用法​

重试机制​

自定义 Transport​

事件钩子​

完整爬虫示例​

同步爬虫​

异步爬虫​

从 requests 迁移​

主要差异​

迁移注意事项​

小结​

练习​