API 服务部署
本章详细介绍如何使用 vLLM 部署生产级的 API 服务。
启动 OpenAI 兼容服务
vLLM 提供了与 OpenAI API 完全兼容的 HTTP 服务,方便迁移现有应用。
基本用法
# 最简单的方式
vllm serve meta-llama/Llama-2-7b-chat-hf
# 指定参数
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9
完整参数列表
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-seqs 256 \
--quantization awq \
--dtype auto \
--api-key your-api-key # 可选,设置 API 认证
API 端点
Completions API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "人工智能是",
"max_tokens": 100,
"temperature": 0.8,
"top_p": 0.95
}'
Chat Completions API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "你是一个 helpful 的 AI 助手。"},
{"role": "user", "content": "什么是机器学习?"}
],
"max_tokens": 200,
"temperature": 0.7
}'
Models API
# 列出可用模型
curl http://localhost:8000/v1/models
Python 客户端调用
使用 OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM 默认不需要 API key
)
# Completions
response = client.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
prompt="人工智能是",
max_tokens=100
)
print(response.choices[0].text)
# Chat Completions
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "你好"}
]
)
print(response.choices[0].message.content)
流式调用
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# 流式输出
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "讲一个故事"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
异步客户端
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
async def generate(prompt):
response = await client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content
async def main():
prompts = ["问题1", "问题2", "问题3"]
tasks = [generate(p) for p in prompts]
results = await asyncio.gather(*tasks)
for r in results:
print(r)
asyncio.run(main())
生产环境部署
Docker 部署
FROM vllm/vllm-openai:latest
# 设置环境变量
ENV MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
ENV TENSOR_PARALLEL_SIZE=1
# 暴露端口
EXPOSE 8000
# 启动服务
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--host 0.0.0.0 \
--port 8000
构建和运行:
docker build -t my-vllm-server .
docker run -d \
--name vllm-server \
--runtime nvidia --gpus all \
-p 8000:8000 \
-e MODEL_NAME=meta-llama/Llama-2-7b-chat-hf \
my-vllm-server
Kubernetes 部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --tensor-parallel-size
- "1"
- --host
- 0.0.0.0
- --port
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer
负载均衡
使用 Nginx 进行负载均衡:
upstream vllm_backend {
server localhost:8000;
server localhost:8001;
server localhost:8002;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# 流式输出支持
proxy_buffering off;
proxy_cache off;
}
}
安全配置
API 密钥认证
vllm serve meta-llama/Llama-2-7b-chat-hf \
--api-key sk-your-secret-key
客户端调用:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-secret-key"
)
请求限制
# 使用中间件实现限流
from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address
app = FastAPI()
limiter = Limiter(key_func=get_remote_address)
@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completions(request: Request):
# 处理请求
pass
监控和日志
启用指标收集
vllm serve meta-llama/Llama-2-7b-chat-hf \
--enable-metrics
Prometheus 指标
vLLM 暴露以下 Prometheus 指标:
vllm:num_requests_running:正在处理的请求数vllm:num_requests_waiting:等待处理的请求数vllm:gpu_cache_usage_perc:GPU 缓存使用率vllm:time_to_first_token_seconds:首 token 延迟vllm:time_per_output_token_seconds:每个输出 token 的耗时
日志配置
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# vLLM 日志
vllm_logger = logging.getLogger("vllm")
vllm_logger.setLevel(logging.DEBUG)
小结
本章介绍了 vLLM API 服务的部署方法:
- 快速启动:使用
vllm serve命令启动服务 - API 调用:支持 OpenAI 兼容的 Completions 和 Chat Completions API
- 生产部署:Docker 和 Kubernetes 部署方案
- 安全认证:API 密钥和请求限制
- 监控日志:Prometheus 指标和日志配置