C++ 模型部署

在自动驾驶、工业检测、实时视频分析等对延迟和资源有严格要求的场景中，C++ 是模型部署的首选语言。ONNX Runtime 提供了成熟的 C++ API，支持跨平台部署和高性能推理。

本章将详细介绍如何使用 ONNX Runtime C++ API 进行模型部署，从环境搭建到完整的推理代码实现。

为什么选择 C++ 部署？

虽然 Python 在模型开发和原型验证阶段非常方便，但在生产环境中 C++ 有明显优势：

性能：C++ 没有解释器开销，内存管理更精细，可以获得更稳定、更低的延迟。

依赖管理：部署包体积小，不需要安装 Python 解释器和众多依赖包。

系统集成：与现有的 C++ 系统无缝集成，适合嵌入式设备和高性能服务端。

多线程：C++ 的多线程控制更精细，便于实现复杂的并发推理逻辑。

环境搭建

获取 ONNX Runtime 库

ONNX Runtime 提供预编译的二进制包，可以从 GitHub Releases 页面下载：

https://github.com/microsoft/onnxruntime/releases

根据目标平台选择对应的包：

平台	包名示例
Windows x64 CPU	onnxruntime-win-x64-1.16.3.zip
Windows x64 GPU	onnxruntime-win-x64-gpu-1.16.3.zip
Linux x64 CPU	onnxruntime-linux-x64-1.16.3.tgz
Linux x64 GPU	onnxruntime-linux-x64-gpu-1.16.3.tgz

下载后解压，目录结构如下：

onnxruntime/
├── include/           # 头文件
│   └── onnxruntime_cxx_api.h
├── lib/               # 库文件
│   ├── onnxruntime.dll    # Windows
│   ├── onnxruntime.lib    # Windows
│   └── libonnxruntime.so  # Linux
└── ...

CMake 项目配置

创建一个完整的 CMake 项目：

cmake_minimum_required(VERSION 3.15)
project(OnnxInference)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# 设置 ONNX Runtime 路径
set(ONNXRUNTIME_DIR "${CMAKE_SOURCE_DIR}/onnxruntime")

# 添加头文件路径
include_directories(${ONNXRUNTIME_DIR}/include)

# 添加库文件路径
link_directories(${ONNXRUNTIME_DIR}/lib)

# 创建可执行文件
add_executable(inference_demo
    src/main.cpp
    src/inference.cpp
)

# 链接 ONNX Runtime 库
if(WIN32)
    target_link_libraries(inference_demo onnxruntime)
else()
    target_link_libraries(inference_demo onnxruntime)
endif()

# Windows 需要复制 DLL 到输出目录
if(WIN32)
    add_custom_command(TARGET inference_demo POST_BUILD
        COMMAND ${CMAKE_COMMAND} -E copy_if_different
        "${ONNXRUNTIME_DIR}/lib/onnxruntime.dll"
        $<TARGET_FILE_DIR:inference_demo>
    )
endif()

C++ API 核心概念

ONNX Runtime C++ API 使用 RAII 模式管理资源，主要类如下：

类名	作用	生命周期
`Ort::Env`	环境对象，管理全局状态	整个程序
`Ort::Session`	推理会话，加载和运行模型	长期存在，复用
`Ort::SessionOptions`	会话配置选项	创建 Session 时
`Ort::MemoryInfo`	内存分配信息	创建张量时
`Ort::Value`	张量对象，表示输入输出	单次推理
`Ort::IoBinding`	IO 绑定，用于 GPU 推理	单次推理

基础推理流程

完整的推理示例

// main.cpp
#include <onnxruntime_cxx_api.h>
#include <iostream>
#include <vector>
#include <array>

int main() {
    try {
        // 1. 创建环境
        Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "inference_demo");
        
        // 2. 配置会话选项
        Ort::SessionOptions session_options;
        session_options.SetIntraOpNumThreads(4);
        session_options.SetGraphOptimizationLevel(
            GraphOptimizationLevel::ORT_ENABLE_ALL
        );
        
        // 3. 创建推理会话
        const char* model_path = "model.onnx";
        Ort::Session session(env, model_path, session_options);
        
        // 4. 获取模型输入输出信息
        Ort::AllocatorWithDefaultOptions allocator;
        
        // 获取输入信息
        size_t num_input_nodes = session.GetInputCount();
        std::vector<const char*> input_names(num_input_nodes);
        std::vector<std::vector<int64_t>> input_shapes(num_input_nodes);
        
        for (size_t i = 0; i < num_input_nodes; i++) {
            auto input_name = session.GetInputNameAllocated(i, allocator);
            input_names[i] = input_name.get();
            
            auto type_info = session.GetInputTypeInfo(i);
            auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
            input_shapes[i] = tensor_info.GetShape();
            
            std::cout << "输入 " << i << ": " << input_names[i] << std::endl;
            std::cout << "  形状: [";
            for (size_t j = 0; j < input_shapes[i].size(); j++) {
                std::cout << input_shapes[i][j];
                if (j < input_shapes[i].size() - 1) std::cout << ", ";
            }
            std::cout << "]" << std::endl;
        }
        
        // 获取输出信息
        size_t num_output_nodes = session.GetOutputCount();
        std::vector<const char*> output_names(num_output_nodes);
        
        for (size_t i = 0; i < num_output_nodes; i++) {
            auto output_name = session.GetOutputNameAllocated(i, allocator);
            output_names[i] = output_name.get();
            std::cout << "输出 " << i << ": " << output_names[i] << std::endl;
        }
        
        // 5. 准备输入数据
        // 假设输入形状是 [1, 3, 224, 224]
        std::vector<int64_t> input_shape = {1, 3, 224, 224};
        size_t input_tensor_size = 1 * 3 * 224 * 224;
        std::vector<float> input_tensor_values(input_tensor_size, 0.5f);
        
        // 6. 创建内存信息
        auto memory_info = Ort::MemoryInfo::CreateCpu(
            OrtArenaAllocator, 
            OrtMemTypeDefault
        );
        
        // 7. 创建输入张量
        Ort::Value input_tensor = Ort::Value::CreateTensor<float>(
            memory_info,
            input_tensor_values.data(),
            input_tensor_size,
            input_shape.data(),
            input_shape.size()
        );
        
        // 8. 执行推理
        std::vector<Ort::Value> output_tensors = session.Run(
            Ort::RunOptions{nullptr},
            input_names.data(),
            &input_tensor,
            num_input_nodes,
            output_names.data(),
            num_output_nodes
        );
        
        // 9. 获取输出结果
        float* output_data = output_tensors[0].GetTensorMutableData<float>();
        auto output_shape = output_tensors[0].GetTensorTypeAndShapeInfo().GetShape();
        
        std::cout << "输出形状: [";
        for (size_t i = 0; i < output_shape.size(); i++) {
            std::cout << output_shape[i];
            if (i < output_shape.size() - 1) std::cout << ", ";
        }
        std::cout << "]" << std::endl;
        
        // 打印前 10 个输出值
        std::cout << "前 10 个输出值: ";
        for (int i = 0; i < 10 && i < output_shape[1]; i++) {
            std::cout << output_data[i] << " ";
        }
        std::cout << std::endl;
        
    } catch (const Ort::Exception& e) {
        std::cerr << "ONNX Runtime 错误: " << e.what() << std::endl;
        return -1;
    }
    
    return 0;
}

代码解析

环境对象 Env：这是 ONNX Runtime 的根对象，负责日志系统和线程池管理。整个程序只需要一个实例，通常在 main 函数开头创建。

会话选项 SessionOptions：控制推理行为的配置对象。常用的配置包括：

SetIntraOpNumThreads()：设置单个算子内部的并行线程数
SetGraphOptimizationLevel()：设置图优化级别
AddConfigEntry()：添加自定义配置项

会话 Session：加载模型并执行推理的核心对象。创建时需要指定模型路径和会话选项。Session 是线程安全的，可以在多个线程中同时调用 Run 方法。

张量 Value：表示输入输出的张量数据。创建张量时需要指定内存信息、数据指针、数据大小和形状。

GPU 推理配置

启用 CUDA 加速需要在创建会话时指定 CUDA 执行提供者：

#include <onnxruntime_cxx_api.h>

Ort::Session create_gpu_session(
    Ort::Env& env,
    const char* model_path
) {
    Ort::SessionOptions session_options;
    session_options.SetGraphOptimizationLevel(
        GraphOptimizationLevel::ORT_ENABLE_ALL
    );
    
    // 配置 CUDA 选项
    OrtCUDAProviderOptions cuda_options;
    cuda_options.device_id = 0;  // GPU 设备 ID
    cuda_options.arena_extend_strategy = 0;  // 内存分配策略
    cuda_options.gpu_mem_limit = 2 * 1024 * 1024 * 1024UL;  // GPU 内存限制（2GB）
    cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::OrtCudnnConvAlgoSearchExhaustive;
    cuda_options.do_copy_in_default_stream = true;
    
    // 添加 CUDA 执行提供者
    session_options.AddConfigEntry(
        "session.intra_op_num_threads", 
        "1"
    );
    
    // 注意：CUDA 选项需要通过 AppendExecutionProvider_CUDA 添加
    // 这需要包含相应的头文件
    // session_options.AppendExecutionProvider_CUDA(cuda_options);
    
    // 或者使用更简单的方式（ONNX Runtime 1.10+）
    std::vector<std::string> provider_names;
    provider_names.push_back("CUDAExecutionProvider");
    provider_names.push_back("CPUExecutionProvider");
    
    Ort::Session session(env, model_path, session_options);
    return session;
}

实际上，在较新版本的 ONNX Runtime 中，添加 CUDA 提供者的方式如下：

// 包含 CUDA 提供者头文件（需要 GPU 版本的 ONNX Runtime）
#include "cuda_provider_factory.h"

Ort::SessionOptions session_options;
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0));

Ort::Session session(env, model_path, session_options);

封装推理类

在实际项目中，通常会将推理逻辑封装成一个类：

// onnx_inference.h
#pragma once

#include <onnxruntime_cxx_api.h>
#include <string>
#include <vector>
#include <memory>

class OnnxInference {
public:
    OnnxInference(
        const std::string& model_path,
        int num_threads = 4,
        bool use_gpu = false,
        int gpu_id = 0
    );
    
    // 单输入推理
    std::vector<float> run(const std::vector<float>& input_data);
    
    // 批处理推理
    std::vector<std::vector<float>> run_batch(
        const std::vector<std::vector<float>>& batch_input
    );
    
    // 获取输入输出形状
    std::vector<int64_t> get_input_shape() const { return input_shape_; }
    std::vector<int64_t> get_output_shape() const { return output_shape_; }
    
private:
    std::unique_ptr<Ort::Env> env_;
    std::unique_ptr<Ort::Session> session_;
    std::unique_ptr<Ort::SessionOptions> session_options_;
    
    std::vector<int64_t> input_shape_;
    std::vector<int64_t> output_shape_;
    std::vector<const char*> input_names_;
    std::vector<const char*> output_names_;
    
    Ort::MemoryInfo memory_info_;
    Ort::AllocatorWithDefaultOptions allocator_;
    
    void print_model_info();
};

// onnx_inference.cpp
#include "onnx_inference.h"
#include <iostream>
#include <stdexcept>

OnnxInference::OnnxInference(
    const std::string& model_path,
    int num_threads,
    bool use_gpu,
    int gpu_id
) : memory_info_(Ort::MemoryInfo::CreateCpu(
        OrtArenaAllocator, OrtMemTypeDefault
    )) {
    
    // 创建环境
    env_ = std::make_unique<Ort::Env>(
        ORT_LOGGING_LEVEL_WARNING, 
        "OnnxInference"
    );
    
    // 配置会话选项
    session_options_ = std::make_unique<Ort::SessionOptions>();
    session_options_->SetIntraOpNumThreads(num_threads);
    session_options_->SetGraphOptimizationLevel(
        GraphOptimizationLevel::ORT_ENABLE_ALL
    );
    
    // GPU 配置
    if (use_gpu) {
        // 添加 CUDA 执行提供者
        // 注意：需要链接 GPU 版本的 ONNX Runtime
        OrtCUDAProviderOptions cuda_options;
        cuda_options.device_id = gpu_id;
        session_options_->AppendExecutionProvider_CUDA(cuda_options);
    }
    
    // 创建会话
    session_ = std::make_unique<Ort::Session>(
        *env_, 
        model_path.c_str(), 
        *session_options_
    );
    
    // 获取输入输出信息
    size_t num_inputs = session_->GetInputCount();
    size_t num_outputs = session_->GetOutputCount();
    
    input_names_.resize(num_inputs);
    output_names_.resize(num_outputs);
    
    for (size_t i = 0; i < num_inputs; i++) {
        auto name = session_->GetInputNameAllocated(i, allocator_);
        input_names_[i] = name.get();
        name.release();  // 转移所有权
        
        auto type_info = session_->GetInputTypeInfo(i);
        auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
        input_shape_ = tensor_info.GetShape();
    }
    
    for (size_t i = 0; i < num_outputs; i++) {
        auto name = session_->GetOutputNameAllocated(i, allocator_);
        output_names_[i] = name.get();
        name.release();
        
        auto type_info = session_->GetOutputTypeInfo(i);
        auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
        output_shape_ = tensor_info.GetShape();
    }
    
    print_model_info();
}

void OnnxInference::print_model_info() {
    std::cout << "模型信息:" << std::endl;
    
    std::cout << "  输入形状: [";
    for (size_t i = 0; i < input_shape_.size(); i++) {
        std::cout << input_shape_[i];
        if (i < input_shape_.size() - 1) std::cout << ", ";
    }
    std::cout << "]" << std::endl;
    
    std::cout << "  输出形状: [";
    for (size_t i = 0; i < output_shape_.size(); i++) {
        std::cout << output_shape_[i];
        if (i < output_shape_.size() - 1) std::cout << ", ";
    }
    std::cout << "]" << std::endl;
}

std::vector<float> OnnxInference::run(const std::vector<float>& input_data) {
    // 计算输入大小
    size_t input_size = 1;
    for (auto dim : input_shape_) {
        if (dim > 0) input_size *= dim;
        else if (dim == -1) {
            // 动态维度，假设为 1
            input_size *= 1;
        }
    }
    
    if (input_data.size() != input_size) {
        throw std::runtime_error(
            "输入数据大小不匹配: 期望 " + std::to_string(input_size) +
            "，实际 " + std::to_string(input_data.size())
        );
    }
    
    // 处理动态维度
    std::vector<int64_t> actual_input_shape = input_shape_;
    for (auto& dim : actual_input_shape) {
        if (dim == -1) dim = 1;
    }
    
    // 创建输入张量
    Ort::Value input_tensor = Ort::Value::CreateTensor<float>(
        memory_info_,
        const_cast<float*>(input_data.data()),
        input_data.size(),
        actual_input_shape.data(),
        actual_input_shape.size()
    );
    
    // 执行推理
    std::vector<Ort::Value> output_tensors = session_->Run(
        Ort::RunOptions{nullptr},
        input_names_.data(),
        &input_tensor,
        1,
        output_names_.data(),
        output_names_.size()
    );
    
    // 获取输出数据
    float* output_data = output_tensors[0].GetTensorMutableData<float>();
    auto output_shape = output_tensors[0].GetTensorTypeAndShapeInfo().GetShape();
    
    size_t output_size = 1;
    for (auto dim : output_shape) {
        output_size *= dim;
    }
    
    return std::vector<float>(output_data, output_data + output_size);
}

std::vector<std::vector<float>> OnnxInference::run_batch(
    const std::vector<std::vector<float>>& batch_input
) {
    std::vector<std::vector<float>> results;
    results.reserve(batch_input.size());
    
    for (const auto& input : batch_input) {
        results.push_back(run(input));
    }
    
    return results;
}

使用示例

// main.cpp
#include "onnx_inference.h"
#include <iostream>
#include <vector>

int main() {
    try {
        // 创建推理实例
        OnnxInference infer("model.onnx", 4, false);
        
        // 准备输入数据（假设输入是 3x224x224）
        std::vector<float> input(3 * 224 * 224, 0.5f);
        
        // 执行推理
        std::vector<float> output = infer.run(input);
        
        std::cout << "推理完成，输出大小: " << output.size() << std::endl;
        
        // 批处理推理
        std::vector<std::vector<float>> batch_input(10, input);
        auto batch_output = infer.run_batch(batch_input);
        std::cout << "批处理推理完成，批次数: " << batch_output.size() << std::endl;
        
    } catch (const std::exception& e) {
        std::cerr << "错误: " << e.what() << std::endl;
        return -1;
    }
    
    return 0;
}

图像预处理

在实际应用中，输入数据通常需要预处理。以下是一个完整的图像推理示例：

// image_inference.h
#pragma once

#include "onnx_inference.h"
#include <opencv2/opencv.hpp>
#include <vector>

class ImageClassifier {
public:
    ImageClassifier(
        const std::string& model_path,
        int input_height = 224,
        int input_width = 224
    );
    
    std::vector<float> classify(const std::string& image_path);
    std::vector<float> classify(const cv::Mat& image);
    
    int get_top_class(const std::vector<float>& probs);
    
private:
    std::unique_ptr<OnnxInference> inference_;
    int input_height_;
    int input_width_;
    
    std::vector<float> preprocess(const cv::Mat& image);
    std::vector<float> softmax(const std::vector<float>& logits);
};

// image_inference.cpp
#include "image_inference.h"
#include <algorithm>
#include <numeric>

ImageClassifier::ImageClassifier(
    const std::string& model_path,
    int input_height,
    int input_width
) : input_height_(input_height), input_width_(input_width) {
    inference_ = std::make_unique<OnnxInference>(model_path, 4, false);
}

std::vector<float> ImageClassifier::preprocess(const cv::Mat& image) {
    // 调整大小
    cv::Mat resized;
    cv::resize(image, resized, cv::Size(input_width_, input_height_));
    
    // 转换为 RGB
    cv::Mat rgb;
    cv::cvtColor(resized, rgb, cv::COLOR_BGR2RGB);
    
    // 转换为 float 并归一化
    rgb.convertTo(rgb, CV_32F, 1.0 / 255.0);
    
    // 标准化（ImageNet 均值和标准差）
    std::vector<float> mean = {0.485f, 0.456f, 0.406f};
    std::vector<float> std = {0.229f, 0.224f, 0.225f};
    
    std::vector<float> input_tensor(3 * input_height_ * input_width_);
    
    // HWC -> CHW 并应用标准化
    for (int c = 0; c < 3; c++) {
        for (int h = 0; h < input_height_; h++) {
            for (int w = 0; w < input_width_; w++) {
                float pixel = rgb.at<cv::Vec3f>(h, w)[c];
                input_tensor[c * input_height_ * input_width_ + h * input_width_ + w] =
                    (pixel - mean[c]) / std[c];
            }
        }
    }
    
    return input_tensor;
}

std::vector<float> ImageClassifier::softmax(const std::vector<float>& logits) {
    std::vector<float> probs(logits.size());
    
    // 找最大值（数值稳定性）
    float max_val = *std::max_element(logits.begin(), logits.end());
    
    // 计算 exp
    float sum = 0.0f;
    for (size_t i = 0; i < logits.size(); i++) {
        probs[i] = std::exp(logits[i] - max_val);
        sum += probs[i];
    }
    
    // 归一化
    for (auto& p : probs) {
        p /= sum;
    }
    
    return probs;
}

std::vector<float> ImageClassifier::classify(const cv::Mat& image) {
    // 预处理
    std::vector<float> input = preprocess(image);
    
    // 推理
    std::vector<float> logits = inference_->run(input);
    
    // Softmax
    return softmax(logits);
}

std::vector<float> ImageClassifier::classify(const std::string& image_path) {
    cv::Mat image = cv::imread(image_path);
    if (image.empty()) {
        throw std::runtime_error("无法加载图像: " + image_path);
    }
    return classify(image);
}

int ImageClassifier::get_top_class(const std::vector<float>& probs) {
    return std::distance(probs.begin(), 
        std::max_element(probs.begin(), probs.end()));
}

多线程推理

在服务端场景中，需要处理并发推理请求。ONNX Runtime 的 Session 是线程安全的，可以在多个线程中共享：

#include <onnxruntime_cxx_api.h>
#include <thread>
#include <vector>
#include <queue>
#include <mutex>
#include <condition_variable>
#include <functional>

class AsyncInferenceEngine {
public:
    using Callback = std::function<void(std::vector<float>, std::string)>;
    
    AsyncInferenceEngine(
        const std::string& model_path,
        int num_threads = 4
    ) : stop_(false) {
        env_ = std::make_unique<Ort::Env>(ORT_LOGGING_LEVEL_WARNING, "AsyncEngine");
        
        session_options_ = std::make_unique<Ort::SessionOptions>();
        session_options_->SetIntraOpNumThreads(1);
        session_options_->SetGraphOptimizationLevel(
            GraphOptimizationLevel::ORT_ENABLE_ALL
        );
        
        session_ = std::make_unique<Ort::Session>(
            *env_, model_path.c_str(), *session_options_
        );
        
        // 获取输入输出信息
        Ort::AllocatorWithDefaultOptions allocator;
        auto input_name = session_->GetInputNameAllocated(0, allocator);
        input_name_ = input_name.get();
        auto output_name = session_->GetOutputNameAllocated(0, allocator);
        output_name_ = output_name.get();
        
        // 启动工作线程
        for (int i = 0; i < num_threads; i++) {
            workers_.emplace_back(&AsyncInferenceEngine::worker_loop, this);
        }
    }
    
    ~AsyncInferenceEngine() {
        {
            std::unique_lock<std::mutex> lock(queue_mutex_);
            stop_ = true;
        }
        queue_cv_.notify_all();
        
        for (auto& worker : workers_) {
            if (worker.joinable()) {
                worker.join();
            }
        }
    }
    
    void submit(
        const std::vector<float>& input,
        Callback callback
    ) {
        {
            std::unique_lock<std::mutex> lock(queue_mutex_);
            tasks_.emplace(input, callback);
        }
        queue_cv_.notify_one();
    }
    
private:
    struct Task {
        std::vector<float> input;
        Callback callback;
    };
    
    void worker_loop() {
        auto memory_info = Ort::MemoryInfo::CreateCpu(
            OrtArenaAllocator, OrtMemTypeDefault
        );
        
        while (true) {
            Task task;
            {
                std::unique_lock<std::mutex> lock(queue_mutex_);
                queue_cv_.wait(lock, [this] {
                    return stop_ || !tasks_.empty();
                });
                
                if (stop_ && tasks_.empty()) {
                    return;
                }
                
                task = std::move(tasks_.front());
                tasks_.pop();
            }
            
            try {
                // 执行推理
                std::vector<int64_t> shape = {1, 3, 224, 224};
                Ort::Value input_tensor = Ort::Value::CreateTensor<float>(
                    memory_info,
                    const_cast<float*>(task.input.data()),
                    task.input.size(),
                    shape.data(),
                    shape.size()
                );
                
                const char* input_names[] = {input_name_.c_str()};
                const char* output_names[] = {output_name_.c_str()};
                
                auto outputs = session_->Run(
                    Ort::RunOptions{nullptr},
                    input_names,
                    &input_tensor,
                    1,
                    output_names,
                    1
                );
                
                float* data = outputs[0].GetTensorMutableData<float>();
                auto output_shape = outputs[0].GetTensorTypeAndShapeInfo().GetShape();
                size_t size = 1;
                for (auto d : output_shape) size *= d;
                
                std::vector<float> result(data, data + size);
                task.callback(result, "");
                
            } catch (const Ort::Exception& e) {
                task.callback({}, e.what());
            }
        }
    }
    
    std::unique_ptr<Ort::Env> env_;
    std::unique_ptr<Ort::Session> session_;
    std::unique_ptr<Ort::SessionOptions> session_options_;
    
    std::string input_name_;
    std::string output_name_;
    
    std::vector<std::thread> workers_;
    std::queue<Task> tasks_;
    std::mutex queue_mutex_;
    std::condition_variable queue_cv_;
    bool stop_;
};

内存管理注意事项

张量数据的生命周期

创建 Ort::Value 时传入的数据指针必须保持有效，直到推理完成：

// 错误示例：数据在推理前被释放
Ort::Value create_tensor_wrong() {
    std::vector<float> data(100, 0.5f);  // 局部变量
    auto memory_info = Ort::MemoryInfo::CreateCpu(
        OrtArenaAllocator, OrtMemTypeDefault
    );
    std::vector<int64_t> shape = {1, 100};
    
    // 返回的张量引用了即将销毁的数据！
    return Ort::Value::CreateTensor<float>(
        memory_info, data.data(), data.size(), shape.data(), shape.size()
    );
}  // data 在这里被销毁

// 正确示例：确保数据生命周期足够长
class InferenceWrapper {
    std::vector<float> input_buffer_;  // 成员变量
    
    Ort::Value create_tensor() {
        input_buffer_.resize(100, 0.5f);
        auto memory_info = Ort::MemoryInfo::CreateCpu(
            OrtArenaAllocator, OrtMemTypeDefault
        );
        std::vector<int64_t> shape = {1, 100};
        
        return Ort::Value::CreateTensor<float>(
            memory_info, 
            input_buffer_.data(), 
            input_buffer_.size(), 
            shape.data(), 
            shape.size()
        );
    }
};

GPU 内存管理

使用 GPU 推理时，需要注意显存的管理：

// 使用 IOBinding 进行 GPU 推理
void gpu_inference(Ort::Session& session) {
    auto io_binding = session.CreateIoBinding();
    
    // 创建 GPU 上的输入张量
    std::vector<float> host_input(3 * 224 * 224, 0.5f);
    Ort::MemoryInfo cuda_memory_info("Cuda", OrtArenaAllocator, 0, OrtMemTypeDefault);
    
    // 绑定输入和输出
    // io_binding.BindInput(...);
    // io_binding.BindOutput(...);
    
    // 执行推理
    session.RunWithIoBinding(io_binding);
}

错误处理

ONNX Runtime 使用异常来报告错误，建议使用 try-catch 包装推理代码：

#include <onnxruntime_cxx_api.h>
#include <iostream>

void safe_inference(const char* model_path) {
    try {
        Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
        Ort::SessionOptions session_options;
        Ort::Session session(env, model_path, session_options);
        
        // 推理代码...
        
    } catch (const Ort::Exception& e) {
        std::cerr << "ONNX Runtime 错误: " << e.what() << std::endl;
        std::cerr << "错误代码: " << e.GetOrtErrorCode() << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "标准错误: " << e.what() << std::endl;
    }
}

常见的错误代码：

错误代码	含义	常见原因
`ORT_NO_SUCHFILE`	文件不存在	模型路径错误
`ORT_INVALID_ARGUMENT`	参数无效	输入形状不匹配、数据类型错误
`ORT_NOT_IMPLEMENTED`	未实现	使用了不支持的算子
`ORT_RUNTIME_EXCEPTION`	运行时错误	GPU 内存不足、CUDA 错误

部署清单

将 C++ 程序部署到目标机器时，需要确保：

依赖文件：

ONNX Runtime 动态库（onnxruntime.dll 或 libonnxruntime.so）
CUDA 相关库（如果使用 GPU）
OpenCV 库（如果使用图像处理）

环境配置：

CUDA 驱动和运行时（GPU 推理）
正确的 PATH/LD_LIBRARY_PATH 设置

测试验证：

输出结果与 Python 版本一致
性能满足延迟和吞吐要求
内存使用稳定，无泄漏

下一步

掌握了 C++ 部署之后，接下来可以学习：

模型优化技术，进一步提升推理性能
INT8 量化部署，降低延迟和内存占用
多模型级联推理，构建完整的 AI 应用流水线

为什么选择 C++ 部署？​

环境搭建​

获取 ONNX Runtime 库​

CMake 项目配置​

C++ API 核心概念​

基础推理流程​

完整的推理示例​

代码解析​

GPU 推理配置​

封装推理类​

使用示例​

图像预处理​

多线程推理​

内存管理注意事项​

张量数据的生命周期​

GPU 内存管理​

错误处理​

部署清单​

下一步​