Python 推理 (ONNX Runtime)

ONNX Runtime 是微软官方维护的高性能推理引擎，也是目前最成熟的 ONNX 推理解决方案。本章将详细介绍如何使用 ONNX Runtime Python API 进行模型推理，包括基础用法、高级特性以及性能优化技巧。

为什么选择 ONNX Runtime？

在众多推理引擎中，ONNX Runtime 有几个显著优势：

跨平台兼容性：Windows、Linux、macOS、Android、iOS，甚至 WebAssembly，一处导出，处处运行。

多硬件支持：通过 Execution Provider 机制，可以无缝切换 CPU、CUDA、TensorRT、OpenVINO、DirectML 等不同硬件后端。

轻量级部署：不需要安装 PyTorch 或 TensorFlow，推理环境体积可以控制在几百 MB 以内。

持续优化：微软官方维护，与 Azure 云服务深度集成，更新活跃。

安装

# CPU 版本
pip install onnxruntime

# GPU 版本（需要 CUDA 环境）
pip install onnxruntime-gpu

# 注意：两者不能同时安装，GPU 版本包含 CPU 支持

验证安装和硬件支持

import onnxruntime as ort

print(f"ONNX Runtime 版本: {ort.__version__}")

# 查看所有可用的执行提供者
providers = ort.get_available_providers()
print(f"可用执行提供者: {providers}")

# 常见输出：
# ['CPUExecutionProvider', 'CUDAExecutionProvider', 'TensorrtExecutionProvider']

基础推理流程

最简单的推理示例

import onnxruntime as ort
import numpy as np

# 1. 创建推理会话
session = ort.InferenceSession("model.onnx")

# 2. 获取输入输出信息
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# 3. 准备输入数据
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# 4. 执行推理
outputs = session.run([output_name], {input_name: input_data})

# 5. 获取结果
result = outputs[0]
print(f"输出形状: {result.shape}")

查看模型信息

在推理之前，了解模型的输入输出规范非常重要：

session = ort.InferenceSession("model.onnx")

print("=" * 50)
print("模型输入:")
for inp in session.get_inputs():
    print(f"  名称: {inp.name}")
    print(f"  形状: {inp.shape}")
    print(f"  类型: {inp.type}")
    print()

print("模型输出:")
for out in session.get_outputs():
    print(f"  名称: {out.name}")
    print(f"  形状: {out.shape}")
    print(f"  类型: {out.type}")
print("=" * 50)

输出示例：

==================================================
模型输入:
  名称: input
  形状: ['batch_size', 3, 224, 224]
  类型: tensor(float)

模型输出:
  名称: output
  形状: ['batch_size', 1000]
  类型: tensor(float)
==================================================

注意形如 ['batch_size', 3, 224, 224] 的输出表示 batch_size 是动态维度。

选择执行提供者

Execution Provider（执行提供者）决定了推理在哪个硬件上运行。选择正确的执行提供者对性能至关重要。

CPU 推理

# 显式指定 CPU
session = ort.InferenceSession(
    "model.onnx",
    providers=['CPUExecutionProvider']
)

CPU 推理适用于：

没有 GPU 的环境
小批量推理场景
对延迟要求不高的场景

CUDA GPU 推理

# CUDA 优先，CPU 回退
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# 验证是否使用了 GPU
print(f"当前使用的提供者: {session.get_providers()}")
# ['CUDAExecutionProvider', 'CPUExecutionProvider'] 表示 CUDA 可用

TensorRT 加速

对于 NVIDIA GPU，TensorRT 可以提供更高的性能：

session = ort.InferenceSession(
    "model.onnx",
    providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
)

TensorRT 适用于：

追求极致延迟的场景
固定输入形状的模型
支持 INT8/FP16 量化的模型

智能选择执行提供者

def create_session(model_path, prefer_gpu=True):
    """智能创建推理会话"""
    available = ort.get_available_providers()
    
    if prefer_gpu and 'CUDAExecutionProvider' in available:
        providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
        print("使用 GPU 推理")
    else:
        providers = ['CPUExecutionProvider']
        print("使用 CPU 推理")
    
    return ort.InferenceSession(model_path, providers=providers)

session = create_session("model.onnx")

多输入多输出模型

很多实际模型有多个输入和输出，处理方式如下：

多输入示例

# 假设模型有两个输入：image 和 metadata
session = ort.InferenceSession("multi_input_model.onnx")

# 查看输入名称
input_names = [inp.name for inp in session.get_inputs()]
# ['image', 'metadata']

# 准备输入数据
image_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
metadata = np.random.randn(1, 10).astype(np.float32)

# 构建输入字典
input_dict = {
    "image": image_data,
    "metadata": metadata
}

# 执行推理
outputs = session.run(None, input_dict)  # None 表示获取所有输出

多输出示例

# 获取所有输出
outputs = session.run(None, input_dict)

# outputs 是一个列表，按模型定义的输出顺序排列
output1 = outputs[0]  # 第一个输出
output2 = outputs[1]  # 第二个输出

# 或者只获取特定输出
output_names = [out.name for out in session.get_outputs()]
# ['class_probs', 'features', 'attention_map']

# 只获取 class_probs 和 features
outputs = session.run(['class_probs', 'features'], input_dict)

SessionOptions 性能配置

SessionOptions 允许精细控制推理行为，是性能优化的关键。

线程配置

options = ort.SessionOptions()

# 设置内部操作并行线程数（单个算子内部的并行）
options.intra_op_num_threads = 4

# 设置操作间并行线程数（多个算子同时执行的并行度）
options.inter_op_num_threads = 1

# 对于 CPU 推理，通常设置 intra_op_num_threads 为 CPU 核心数
import os
options.intra_op_num_threads = os.cpu_count()

session = ort.InferenceSession("model.onnx", sess_options=options)

图优化级别

options = ort.SessionOptions()

# 图优化级别
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# 可选级别：
# ORT_DISABLE_ALL   - 禁用所有优化
# ORT_ENABLE_BASIC  - 基础优化（冗余节点消除等）
# ORT_ENABLE_EXTENDED - 扩展优化（算子融合等）
# ORT_ENABLE_ALL    - 全部优化（包括布局优化）

session = ort.InferenceSession("model.onnx", sess_options=options)

内存配置

options = ort.SessionOptions()

# 启用内存池（减少内存分配开销）
options.enable_cpu_mem_arena = True  # 默认 True

# 启用内存模式优化（分析内存使用模式进行优化）
options.enable_mem_pattern = True  # 默认 True

# 启用内存重用
options.enable_mem_reuse = True  # 默认 True

完整的性能配置示例

def create_optimized_session(model_path, use_gpu=True):
    """创建优化后的推理会话"""
    options = ort.SessionOptions()
    
    # 图优化
    options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    
    # 线程配置（CPU 推理时）
    if not use_gpu:
        options.intra_op_num_threads = os.cpu_count() or 4
    
    # 执行模式
    options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
    # ORT_SEQUENTIAL - 顺序执行（适合低延迟场景）
    # ORT_PARALLEL - 并行执行（适合高吞吐场景）
    
    # 选择提供者
    if use_gpu and 'CUDAExecutionProvider' in ort.get_available_providers():
        providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
    else:
        providers = ['CPUExecutionProvider']
    
    return ort.InferenceSession(model_path, sess_options=options, providers=providers)

session = create_optimized_session("model.onnx")

IOBinding：高效 GPU 推理

对于 GPU 推理，数据在 CPU 和 GPU 之间传输是主要的性能瓶颈。IOBinding 允许直接在 GPU 内存中准备输入和获取输出，避免不必要的数据拷贝。

传统方式的问题

# 传统方式：数据在 CPU 和 GPU 之间来回拷贝
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# 1. CPU -> GPU：ONNX Runtime 内部将 numpy 数组拷贝到 GPU
outputs = session.run(None, {"input": input_data})

# 2. GPU -> CPU：结果拷贝回 CPU
result = outputs[0]  # numpy 数组

# 对于大批量或高频率推理，这些拷贝开销不容忽视

使用 IOBinding

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider'])

# 准备输入数据
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# 创建 OrtValue（数据在 GPU 上）
input_ortvalue = ort.OrtValue.ortvalue_from_numpy(input_data, "cuda", 0)

# 创建 IOBinding
io_binding = session.io_binding()

# 绑定输入
io_binding.bind_input(
    name="input",
    device_type="cuda",
    device_id=0,
    element_type=np.float32,
    shape=input_data.shape,
    buffer_ptr=input_ortvalue.data_ptr()
)

# 绑定输出（让 ONNX Runtime 在 GPU 上分配输出内存）
io_binding.bind_output("output", "cuda")

# 执行推理
session.run_with_iobinding(io_binding)

# 获取输出（数据仍在 GPU 上）
output_ortvalue = io_binding.get_outputs()[0]
print(f"输出在设备: {output_ortvalue.device_name()}")  # 'cuda'

# 如果需要，可以拷贝到 CPU
output_numpy = output_ortvalue.numpy()

IOBinding 与 PyTorch 张量直接交互

当推理是更大流水线的一部分时，可以直接使用 PyTorch 张量，避免转换为 numpy：

import torch

# PyTorch 张量已经在 GPU 上
input_tensor = torch.randn(1, 3, 224, 224, device="cuda:0", dtype=torch.float32)

# 确保 tensor 是连续的
input_tensor = input_tensor.contiguous()

io_binding = session.io_binding()

# 直接绑定 PyTorch 张量
io_binding.bind_input(
    name="input",
    device_type="cuda",
    device_id=0,
    element_type=np.float32,
    shape=tuple(input_tensor.shape),
    buffer_ptr=input_tensor.data_ptr()
)

# 输出也绑定到 PyTorch 张量
output_tensor = torch.empty(1, 1000, device="cuda:0", dtype=torch.float32)
io_binding.bind_output(
    name="output",
    device_type="cuda",
    device_id=0,
    element_type=np.float32,
    shape=tuple(output_tensor.shape),
    buffer_ptr=output_tensor.data_ptr()
)

# 执行推理
session.run_with_iobinding(io_binding)

# output_tensor 现在包含推理结果，可以直接用于后续 PyTorch 操作

异步推理

对于需要处理大量请求的服务场景，异步推理可以提高吞吐量。

import onnxruntime as ort
from threading import Lock
import queue
import threading

class AsyncInferenceSession:
    """异步推理会话封装"""
    
    def __init__(self, model_path):
        self.session = ort.InferenceSession(model_path)
        self.lock = Lock()  # 保护会话的线程安全
    
    def run_async(self, input_data, callback, user_data=None):
        """异步执行推理
        
        Args:
            input_data: 输入数据字典
            callback: 回调函数，签名 callback(outputs, user_data, error)
            user_data: 传递给回调函数的用户数据
        """
        def _run():
            try:
                with self.lock:
                    outputs = self.session.run(None, input_data)
                callback(outputs, user_data, None)
            except Exception as e:
                callback(None, user_data, str(e))
        
        thread = threading.Thread(target=_run)
        thread.start()
        return thread

# 使用示例
def on_result(outputs, user_data, error):
    if error:
        print(f"推理错误: {error}")
    else:
        print(f"推理完成，输出形状: {outputs[0].shape}")

async_session = AsyncInferenceSession("model.onnx")
async_session.run_async(
    {"input": np.random.randn(1, 3, 224, 224).astype(np.float32)},
    on_result,
    user_data={"request_id": 123}
)

ONNX Runtime 也提供了原生的 run_async 方法：

session = ort.InferenceSession("model.onnx")

def callback(outputs, user_data, error):
    if error:
        print(f"错误: {error}")
    else:
        print(f"结果: {outputs[0].shape}")

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# 原生异步推理
session.run_async(
    ["output"],  # 输出名称
    {"input": input_data},  # 输入数据
    callback,  # 回调函数
    None  # 用户数据
)

批处理推理

对于大量数据的推理，批处理可以显著提高吞吐量。

import numpy as np
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")

# 假设有 1000 张图片要处理
images = np.random.randn(1000, 3, 224, 224).astype(np.float32)

# 方式一：逐个处理（慢）
# for i in range(1000):
#     output = session.run(None, {"input": images[i:i+1]})

# 方式二：批处理（快）
batch_size = 32
outputs = []

for i in range(0, len(images), batch_size):
    batch = images[i:i+batch_size]
    batch_outputs = session.run(None, {"input": batch})
    outputs.append(batch_outputs[0])

# 合并结果
all_outputs = np.concatenate(outputs, axis=0)
print(f"总输出形状: {all_outputs.shape}")  # (1000, num_classes)

推理性能分析

使用内置 Profiler

options = ort.SessionOptions()
options.enable_profiling = True
options.profile_file_prefix = "onnx_profile"

session = ort.InferenceSession("model.onnx", sess_options=options)

# 执行推理
for _ in range(10):
    session.run(None, {"input": np.random.randn(1, 3, 224, 224).astype(np.float32)})

# 获取性能分析文件
profile_file = session.end_profiling()
print(f"性能分析文件: {profile_file}")
# 可以用 Chrome Tracing 工具打开这个 JSON 文件

计时和吞吐量测量

import time
import numpy as np
import onnxruntime as ort

def benchmark(model_path, input_shape, num_iterations=100, warmup=10):
    """推理性能基准测试"""
    session = ort.InferenceSession(model_path)
    input_name = session.get_inputs()[0].name
    
    # 准备输入
    input_data = np.random.randn(*input_shape).astype(np.float32)
    
    # 预热（确保 CUDA 核心初始化完成）
    for _ in range(warmup):
        session.run(None, {input_name: input_data})
    
    # 计时
    times = []
    for _ in range(num_iterations):
        start = time.perf_counter()
        session.run(None, {input_name: input_data})
        end = time.perf_counter()
        times.append(end - start)
    
    # 统计
    times = np.array(times) * 1000  # 转换为毫秒
    print(f"平均延迟: {times.mean():.2f} ms")
    print(f"P50 延迟: {np.percentile(times, 50):.2f} ms")
    print(f"P99 延迟: {np.percentile(times, 99):.2f} ms")
    print(f"吞吐量: {1000 / times.mean() * input_shape[0]:.2f} samples/s")

benchmark("model.onnx", (1, 3, 224, 224))

错误处理

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")

try:
    # 常见错误：输入形状不匹配
    wrong_input = np.random.randn(1, 4, 224, 224).astype(np.float32)  # 通道数错误
    session.run(None, {"input": wrong_input})
except ort.RuntimeException as e:
    print(f"运行时错误: {e}")

try:
    # 常见错误：数据类型不匹配
    wrong_type = np.random.randn(1, 3, 224, 224).astype(np.float64)  # 应该是 float32
    session.run(None, {"input": wrong_type})
except ort.RuntimeException as e:
    print(f"类型错误: {e}")

try:
    # 常见错误：输入名称错误
    session.run(None, {"wrong_name": np.random.randn(1, 3, 224, 224).astype(np.float32)})
except ort.RuntimeException as e:
    print(f"名称错误: {e}")

最佳实践总结

会话管理

复用会话：创建 InferenceSession 有一定开销，应该在程序启动时创建并复用
线程安全：单个会话可以被多个线程同时调用，但创建会话不是线程安全的

内存管理

预分配输出缓冲区：对于固定形状的输出，预先分配 numpy 数组可以减少内存分配
使用 IOBinding：GPU 推理场景下使用 IOBinding 避免不必要的 CPU-GPU 数据传输

批处理策略

选择合适的 batch size：太大会导致内存溢出，太小无法充分利用硬件
动态批处理：服务场景下可以积累请求后批量处理

错误处理

验证输入：推理前检查输入形状和类型
优雅降级：GPU 不可用时自动回退到 CPU

下一步

本章介绍了 ONNX Runtime Python API 的核心功能。下一章将介绍如何在 C++ 中使用 ONNX Runtime，这是生产环境中更常见的部署方式。

为什么选择 ONNX Runtime？​

安装​

验证安装和硬件支持​

基础推理流程​

最简单的推理示例​

查看模型信息​

选择执行提供者​

CPU 推理​

CUDA GPU 推理​

TensorRT 加速​

智能选择执行提供者​

多输入多输出模型​

多输入示例​

多输出示例​

SessionOptions 性能配置​

线程配置​

图优化级别​

内存配置​

完整的性能配置示例​

IOBinding：高效 GPU 推理​

传统方式的问题​

使用 IOBinding​

IOBinding 与 PyTorch 张量直接交互​

异步推理​

批处理推理​

推理性能分析​

使用内置 Profiler​

计时和吞吐量测量​

错误处理​

最佳实践总结​

会话管理​

内存管理​

批处理策略​

错误处理​

下一步​

为什么选择 ONNX Runtime？

安装

验证安装和硬件支持

基础推理流程

最简单的推理示例

查看模型信息

选择执行提供者

CPU 推理

CUDA GPU 推理

TensorRT 加速

智能选择执行提供者

多输入多输出模型

多输入示例

多输出示例

SessionOptions 性能配置

线程配置

图优化级别

内存配置

完整的性能配置示例

IOBinding：高效 GPU 推理

传统方式的问题

使用 IOBinding

IOBinding 与 PyTorch 张量直接交互

异步推理

批处理推理

推理性能分析

使用内置 Profiler

计时和吞吐量测量

错误处理

最佳实践总结

会话管理

内存管理

批处理策略

错误处理

下一步