API 服务部署
本章详细介绍如何使用 vLLM 部署生产级的 API 服务。
启动 OpenAI 兼容服务
vLLM 提供了与 OpenAI API 完全兼容的 HTTP 服务,方便迁移现有应用。
基本用法
# 最简单的方式
vllm serve meta-llama/Llama-2-7b-chat-hf
# 指定参数
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9
完整参数列表
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-seqs 256 \
--quantization awq \
--dtype auto \
--api-key your-api-key # 可选,设置 API 认证
API 端点
Completions API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "人工智能是",
"max_tokens": 100,
"temperature": 0.8,
"top_p": 0.95
}'
Chat Completions API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "你是一个 helpful 的 AI 助手。"},
{"role": "user", "content": "什么是机器学习?"}
],
"max_tokens": 200,
"temperature": 0.7
}'
Models API
# 列出可用模型
curl http://localhost:8000/v1/models
Responses API
vLLM 支持 OpenAI 的 Responses API,提供更结构化的响应格式:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.responses.create(
model="meta-llama/Llama-2-7b-chat-hf",
input="什么是机器学习?"
)
print(response.output_text)
Embeddings API
vLLM 支持 OpenAI 兼容的嵌入 API,适用于嵌入模型:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# 创建嵌入
response = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input="Hello, world!"
)
print(response.data[0].embedding)
对于多模态嵌入模型,可以使用聊天格式的消息:
# 视觉嵌入示例
response = client.embeddings.create(
model="TIGER-Lab/VLM2Vec-Full",
input=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "描述这张图片"}
]
}
]
)
Transcriptions API(语音识别)
vLLM 支持 OpenAI 兼容的语音转文字 API,适用于自动语音识别(ASR)模型:
使用 Transcriptions API 需要安装音频依赖:pip install vllm[audio]
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# 上传音频文件进行转录
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
file=audio_file,
language="zh",
response_format="verbose_json"
)
print(transcription.text)
使用 curl 调用:
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer token-abc123" \
-F \
-F "model=openai/whisper-large-v3-turbo" \
-F "language=zh"
支持的音频格式:FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV、WEBM
Translation API(语音翻译)
Whisper 模型可以将 55 种非英语语言的音频翻译成英语:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
with open("chinese_audio.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="openai/whisper-large-v3",
file=audio_file
)
print(translation.text) # 输出英文翻译
openai/whisper-large-v3-turbo 模型不支持翻译功能,请使用其他版本。
Tokenizer API
vLLM 提供了分词器 API 用于调试和分析:
# 分词
curl http://localhost:8000/tokenize \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "Hello, world!"
}'
# 反分词
curl http://localhost:8000/detokenize \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"tokens": [1, 15043, 29892, 3186, 29991]
}'
Re-rank API
vLLM 支持重排序 API,兼容 Jina AI 和 Cohere 的 API:
curl http://localhost:8000/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "什么是机器学习?",
"documents": [
"机器学习是人工智能的一个分支",
"今天天气很好",
"深度学习是机器学习的子领域"
],
"top_n": 2
}'
Python 客户端调用
使用 OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM 默认不需要 API key
)
# Completions
response = client.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
prompt="人工智能是",
max_tokens=100
)
print(response.choices[0].text)
# Chat Completions
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "你好"}
]
)
print(response.choices[0].message.content)
流式调用
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# 流式输出
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "讲一个故事"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
异步客户端
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
async def generate(prompt):
response = await client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content
async def main():
prompts = ["问题1", "问题2", "问题3"]
tasks = [generate(p) for p in prompts]
results = await asyncio.gather(*tasks)
for r in results:
print(r)
asyncio.run(main())
额外参数支持
vLLM 支持一些 OpenAI API 不支持的参数,可以通过 extra_body 传递:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# 使用 vLLM 特有参数
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "你好"}],
extra_body={
"top_k": 50, # Top-k 采样
"min_p": 0.05, # 最小概率阈值
"repetition_penalty": 1.1, # 重复惩罚
"stop_token_ids": [2, 3], # 停止 token ID
"ignore_eos": False, # 是否忽略 EOS
"skip_special_tokens": True, # 跳过特殊 token
}
)
支持的额外采样参数
| 参数 | 类型 | 说明 |
|---|---|---|
top_k | int | Top-k 采样,只考虑概率最高的 k 个 token |
min_p | float | 最小概率阈值,过滤掉概率过低的 token |
repetition_penalty | float | 重复惩罚系数,大于 1 时惩罚重复 token |
length_penalty | float | 长度惩罚,用于束搜索 |
stop_token_ids | List[int] | 遇到这些 token ID 时停止生成 |
include_stop_str_in_output | bool | 是否在输出中包含停止字符串 |
ignore_eos | bool | 是否忽略 EOS token |
min_tokens | int | 最小生成 token 数 |
skip_special_tokens | bool | 是否跳过特殊 token |
spaces_between_special_tokens | bool | 特殊 token 之间是否添加空格 |
请求优先级
vLLM 支持请求优先级调度:
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "重要请求"}],
extra_body={
"priority": 1, # 数值越小优先级越高
}
)
结构化输出
vLLM 支持强制模型输出特定格式:
# JSON 格式输出
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "列出三种水果"}],
extra_body={
"response_format": {"type": "json_object"},
}
)
# 使用 JSON Schema
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "提取文章信息"}],
extra_body={
"structured_outputs": {
"json_schema": {
"name": "article",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"summary": {"type": "string"}
},
"required": ["title", "author"]
}
}
}
}
)
重复检测
LLM 有时会生成重复的无意义内容(如 "abcdabcdabcd..." 或连续的表情符号),直到达到最大输出长度。vLLM 可以自动检测并停止重复性输出:
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "生成文本"}],
extra_body={
"repetition_detection": {
"ngram_size": 3, # N-gram 大小,检测连续重复的 token 数
"num_repeats": 5, # 重复次数阈值,超过则停止生成
}
}
)
参数说明:
ngram_size:检测的 N-gram 大小,值越小越敏感num_repeats:允许的最大重复次数,达到后停止生成
请求 ID 追踪
vLLM 支持为每个请求设置唯一 ID,便于追踪和调试:
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "你好"}],
extra_body={
"request_id": "my-request-001", # 自定义请求 ID
}
)
# 从响应中获取请求 ID
print(response._request_id)
启动服务时启用请求 ID 头:
vllm serve meta-llama/Llama-2-7b-chat-hf \
--enable-request-id-headers
缓存盐值(多租户安全)
在多用户环境中,为防止攻击者猜测其他用户的 prompt 内容,可以为前缀缓存添加盐值:
import secrets
# 生成安全的随机盐值
cache_salt = secrets.token_urlsafe(32) # 256 bit
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "敏感查询内容"}],
extra_body={
"cache_salt": cache_salt, # 每个用户/会话使用不同的盐值
}
)
聊天模板配置
某些模型需要手动指定聊天模板:
# 指定聊天模板文件
vllm serve meta-llama/Llama-2-7b-chat-hf \
--chat-template /path/to/chat_template.jinja
# 直接使用字符串模板
vllm serve meta-llama/Llama-2-7b-chat-hf \
--chat-template '{% for message in messages %}{{ message.content }}{% endfor %}'
聊天模板内容格式
vLLM 支持多种聊天内容格式:
# 字符串格式(传统)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "你好"}
]
)
# OpenAI 格式(多模态)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "描述这张图片"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}
]
)
通过 --chat-template-content-format 参数可以强制指定格式:string 或 openai
生产环境部署
Docker 部署
FROM vllm/vllm-openai:latest
# 设置环境变量
ENV MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
ENV TENSOR_PARALLEL_SIZE=1
# 暴露端口
EXPOSE 8000
# 启动服务
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--host 0.0.0.0 \
--port 8000
构建和运行:
docker build -t my-vllm-server .
docker run -d \
--name vllm-server \
--runtime nvidia --gpus all \
-p 8000:8000 \
-e MODEL_NAME=meta-llama/Llama-2-7b-chat-hf \
my-vllm-server
Kubernetes 部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --tensor-parallel-size
- "1"
- --host
- 0.0.0.0
- --port
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer
负载均衡
使用 Nginx 进行负载均衡:
upstream vllm_backend {
server localhost:8000;
server localhost:8001;
server localhost:8002;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# 流式输出支持
proxy_buffering off;
proxy_cache off;
}
}
安全配置
API 密钥认证
vllm serve meta-llama/Llama-2-7b-chat-hf \
--api-key sk-your-secret-key
客户端调用:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-secret-key"
)
请求限制
# 使用中间件实现限流
from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address
app = FastAPI()
limiter = Limiter(key_func=get_remote_address)
@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completions(request: Request):
# 处理请求
pass
监控和日志
启用指标收集
vllm serve meta-llama/Llama-2-7b-chat-hf \
--enable-metrics
Prometheus 指标
vLLM 暴露以下 Prometheus 指标:
vllm:num_requests_running:正在处理的请求数vllm:num_requests_waiting:等待处理的请求数vllm:gpu_cache_usage_perc:GPU 缓存使用率vllm:time_to_first_token_seconds:首 token 延迟vllm:time_per_output_token_seconds:每个输出 token 的耗时
日志配置
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# vLLM 日志
vllm_logger = logging.getLogger("vllm")
vllm_logger.setLevel(logging.DEBUG)
小结
本章介绍了 vLLM API 服务的部署方法:
- 快速启动:使用
vllm serve命令启动服务 - API 调用:支持 OpenAI 兼容的 Completions 和 Chat Completions API
- 生产部署:Docker 和 Kubernetes 部署方案
- 安全认证:API 密钥和请求限制
- 监控日志:Prometheus 指标和日志配置