跳到主要内容

API 服务部署

本章详细介绍如何使用 vLLM 部署生产级的 API 服务。

启动 OpenAI 兼容服务

vLLM 提供了与 OpenAI API 完全兼容的 HTTP 服务,方便迁移现有应用。

基本用法

# 最简单的方式
vllm serve meta-llama/Llama-2-7b-chat-hf

# 指定参数
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9

完整参数列表

vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-seqs 256 \
--quantization awq \
--dtype auto \
--api-key your-api-key # 可选,设置 API 认证

API 端点

Completions API

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "人工智能是",
"max_tokens": 100,
"temperature": 0.8,
"top_p": 0.95
}'

Chat Completions API

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "你是一个 helpful 的 AI 助手。"},
{"role": "user", "content": "什么是机器学习?"}
],
"max_tokens": 200,
"temperature": 0.7
}'

Models API

# 列出可用模型
curl http://localhost:8000/v1/models

Responses API

vLLM 支持 OpenAI 的 Responses API,提供更结构化的响应格式:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

response = client.responses.create(
model="meta-llama/Llama-2-7b-chat-hf",
input="什么是机器学习?"
)

print(response.output_text)

Embeddings API

vLLM 支持 OpenAI 兼容的嵌入 API,适用于嵌入模型:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

# 创建嵌入
response = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input="Hello, world!"
)

print(response.data[0].embedding)

对于多模态嵌入模型,可以使用聊天格式的消息:

# 视觉嵌入示例
response = client.embeddings.create(
model="TIGER-Lab/VLM2Vec-Full",
input=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "描述这张图片"}
]
}
]
)

Transcriptions API(语音识别)

vLLM 支持 OpenAI 兼容的语音转文字 API,适用于自动语音识别(ASR)模型:

安装依赖

使用 Transcriptions API 需要安装音频依赖:pip install vllm[audio]

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

# 上传音频文件进行转录
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
file=audio_file,
language="zh",
response_format="verbose_json"
)

print(transcription.text)

使用 curl 调用:

curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer token-abc123" \
-F \
-F "model=openai/whisper-large-v3-turbo" \
-F "language=zh"

支持的音频格式:FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV、WEBM

Translation API(语音翻译)

Whisper 模型可以将 55 种非英语语言的音频翻译成英语:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

with open("chinese_audio.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="openai/whisper-large-v3",
file=audio_file
)

print(translation.text) # 输出英文翻译
模型限制

openai/whisper-large-v3-turbo 模型不支持翻译功能,请使用其他版本。

Tokenizer API

vLLM 提供了分词器 API 用于调试和分析:

# 分词
curl http://localhost:8000/tokenize \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "Hello, world!"
}'

# 反分词
curl http://localhost:8000/detokenize \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"tokens": [1, 15043, 29892, 3186, 29991]
}'

Re-rank API

vLLM 支持重排序 API,兼容 Jina AI 和 Cohere 的 API:

curl http://localhost:8000/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "什么是机器学习?",
"documents": [
"机器学习是人工智能的一个分支",
"今天天气很好",
"深度学习是机器学习的子领域"
],
"top_n": 2
}'

Python 客户端调用

使用 OpenAI SDK

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM 默认不需要 API key
)

# Completions
response = client.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
prompt="人工智能是",
max_tokens=100
)
print(response.choices[0].text)

# Chat Completions
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "你好"}
]
)
print(response.choices[0].message.content)

流式调用

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

# 流式输出
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "讲一个故事"}],
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

异步客户端

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

async def generate(prompt):
response = await client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content

async def main():
prompts = ["问题1", "问题2", "问题3"]
tasks = [generate(p) for p in prompts]
results = await asyncio.gather(*tasks)
for r in results:
print(r)

asyncio.run(main())

额外参数支持

vLLM 支持一些 OpenAI API 不支持的参数,可以通过 extra_body 传递:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

# 使用 vLLM 特有参数
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "你好"}],
extra_body={
"top_k": 50, # Top-k 采样
"min_p": 0.05, # 最小概率阈值
"repetition_penalty": 1.1, # 重复惩罚
"stop_token_ids": [2, 3], # 停止 token ID
"ignore_eos": False, # 是否忽略 EOS
"skip_special_tokens": True, # 跳过特殊 token
}
)

支持的额外采样参数

参数类型说明
top_kintTop-k 采样,只考虑概率最高的 k 个 token
min_pfloat最小概率阈值,过滤掉概率过低的 token
repetition_penaltyfloat重复惩罚系数,大于 1 时惩罚重复 token
length_penaltyfloat长度惩罚,用于束搜索
stop_token_idsList[int]遇到这些 token ID 时停止生成
include_stop_str_in_outputbool是否在输出中包含停止字符串
ignore_eosbool是否忽略 EOS token
min_tokensint最小生成 token 数
skip_special_tokensbool是否跳过特殊 token
spaces_between_special_tokensbool特殊 token 之间是否添加空格

请求优先级

vLLM 支持请求优先级调度:

response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "重要请求"}],
extra_body={
"priority": 1, # 数值越小优先级越高
}
)

结构化输出

vLLM 支持强制模型输出特定格式:

# JSON 格式输出
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "列出三种水果"}],
extra_body={
"response_format": {"type": "json_object"},
}
)

# 使用 JSON Schema
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "提取文章信息"}],
extra_body={
"structured_outputs": {
"json_schema": {
"name": "article",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"summary": {"type": "string"}
},
"required": ["title", "author"]
}
}
}
}
)

重复检测

LLM 有时会生成重复的无意义内容(如 "abcdabcdabcd..." 或连续的表情符号),直到达到最大输出长度。vLLM 可以自动检测并停止重复性输出:

response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "生成文本"}],
extra_body={
"repetition_detection": {
"ngram_size": 3, # N-gram 大小,检测连续重复的 token 数
"num_repeats": 5, # 重复次数阈值,超过则停止生成
}
}
)

参数说明:

  • ngram_size:检测的 N-gram 大小,值越小越敏感
  • num_repeats:允许的最大重复次数,达到后停止生成

请求 ID 追踪

vLLM 支持为每个请求设置唯一 ID,便于追踪和调试:

response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "你好"}],
extra_body={
"request_id": "my-request-001", # 自定义请求 ID
}
)

# 从响应中获取请求 ID
print(response._request_id)

启动服务时启用请求 ID 头:

vllm serve meta-llama/Llama-2-7b-chat-hf \
--enable-request-id-headers

缓存盐值(多租户安全)

在多用户环境中,为防止攻击者猜测其他用户的 prompt 内容,可以为前缀缓存添加盐值:

import secrets

# 生成安全的随机盐值
cache_salt = secrets.token_urlsafe(32) # 256 bit

response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "敏感查询内容"}],
extra_body={
"cache_salt": cache_salt, # 每个用户/会话使用不同的盐值
}
)

聊天模板配置

某些模型需要手动指定聊天模板:

# 指定聊天模板文件
vllm serve meta-llama/Llama-2-7b-chat-hf \
--chat-template /path/to/chat_template.jinja

# 直接使用字符串模板
vllm serve meta-llama/Llama-2-7b-chat-hf \
--chat-template '{% for message in messages %}{{ message.content }}{% endfor %}'

聊天模板内容格式

vLLM 支持多种聊天内容格式:

# 字符串格式(传统)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "你好"}
]
)

# OpenAI 格式(多模态)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "描述这张图片"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}
]
)

通过 --chat-template-content-format 参数可以强制指定格式:stringopenai

生产环境部署

Docker 部署

FROM vllm/vllm-openai:latest

# 设置环境变量
ENV MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
ENV TENSOR_PARALLEL_SIZE=1

# 暴露端口
EXPOSE 8000

# 启动服务
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--host 0.0.0.0 \
--port 8000

构建和运行:

docker build -t my-vllm-server .

docker run -d \
--name vllm-server \
--runtime nvidia --gpus all \
-p 8000:8000 \
-e MODEL_NAME=meta-llama/Llama-2-7b-chat-hf \
my-vllm-server

Kubernetes 部署

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --tensor-parallel-size
- "1"
- --host
- 0.0.0.0
- --port
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer

负载均衡

使用 Nginx 进行负载均衡:

upstream vllm_backend {
server localhost:8000;
server localhost:8001;
server localhost:8002;
}

server {
listen 80;

location /v1/ {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;

# 流式输出支持
proxy_buffering off;
proxy_cache off;
}
}

安全配置

API 密钥认证

vllm serve meta-llama/Llama-2-7b-chat-hf \
--api-key sk-your-secret-key

客户端调用:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-secret-key"
)

请求限制

# 使用中间件实现限流
from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address

app = FastAPI()
limiter = Limiter(key_func=get_remote_address)

@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completions(request: Request):
# 处理请求
pass

监控和日志

启用指标收集

vllm serve meta-llama/Llama-2-7b-chat-hf \
--enable-metrics

Prometheus 指标

vLLM 暴露以下 Prometheus 指标:

  • vllm:num_requests_running:正在处理的请求数
  • vllm:num_requests_waiting:等待处理的请求数
  • vllm:gpu_cache_usage_perc:GPU 缓存使用率
  • vllm:time_to_first_token_seconds:首 token 延迟
  • vllm:time_per_output_token_seconds:每个输出 token 的耗时

日志配置

import logging

logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# vLLM 日志
vllm_logger = logging.getLogger("vllm")
vllm_logger.setLevel(logging.DEBUG)

小结

本章介绍了 vLLM API 服务的部署方法:

  1. 快速启动:使用 vllm serve 命令启动服务
  2. API 调用:支持 OpenAI 兼容的 Completions 和 Chat Completions API
  3. 生产部署:Docker 和 Kubernetes 部署方案
  4. 安全认证:API 密钥和请求限制
  5. 监控日志:Prometheus 指标和日志配置