跳到主要内容

API 服务部署

本章详细介绍如何使用 vLLM 部署生产级的 API 服务。

启动 OpenAI 兼容服务

vLLM 提供了与 OpenAI API 完全兼容的 HTTP 服务,方便迁移现有应用。

基本用法

# 最简单的方式
vllm serve meta-llama/Llama-2-7b-chat-hf

# 指定参数
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9

完整参数列表

vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-seqs 256 \
--quantization awq \
--dtype auto \
--api-key your-api-key # 可选,设置 API 认证

API 端点

Completions API

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "人工智能是",
"max_tokens": 100,
"temperature": 0.8,
"top_p": 0.95
}'

Chat Completions API

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "你是一个 helpful 的 AI 助手。"},
{"role": "user", "content": "什么是机器学习?"}
],
"max_tokens": 200,
"temperature": 0.7
}'

Models API

# 列出可用模型
curl http://localhost:8000/v1/models

Python 客户端调用

使用 OpenAI SDK

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM 默认不需要 API key
)

# Completions
response = client.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
prompt="人工智能是",
max_tokens=100
)
print(response.choices[0].text)

# Chat Completions
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "你好"}
]
)
print(response.choices[0].message.content)

流式调用

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

# 流式输出
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "讲一个故事"}],
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

异步客户端

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)

async def generate(prompt):
response = await client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content

async def main():
prompts = ["问题1", "问题2", "问题3"]
tasks = [generate(p) for p in prompts]
results = await asyncio.gather(*tasks)
for r in results:
print(r)

asyncio.run(main())

生产环境部署

Docker 部署

FROM vllm/vllm-openai:latest

# 设置环境变量
ENV MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
ENV TENSOR_PARALLEL_SIZE=1

# 暴露端口
EXPOSE 8000

# 启动服务
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--host 0.0.0.0 \
--port 8000

构建和运行:

docker build -t my-vllm-server .

docker run -d \
--name vllm-server \
--runtime nvidia --gpus all \
-p 8000:8000 \
-e MODEL_NAME=meta-llama/Llama-2-7b-chat-hf \
my-vllm-server

Kubernetes 部署

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --tensor-parallel-size
- "1"
- --host
- 0.0.0.0
- --port
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer

负载均衡

使用 Nginx 进行负载均衡:

upstream vllm_backend {
server localhost:8000;
server localhost:8001;
server localhost:8002;
}

server {
listen 80;

location /v1/ {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;

# 流式输出支持
proxy_buffering off;
proxy_cache off;
}
}

安全配置

API 密钥认证

vllm serve meta-llama/Llama-2-7b-chat-hf \
--api-key sk-your-secret-key

客户端调用:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-secret-key"
)

请求限制

# 使用中间件实现限流
from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address

app = FastAPI()
limiter = Limiter(key_func=get_remote_address)

@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completions(request: Request):
# 处理请求
pass

监控和日志

启用指标收集

vllm serve meta-llama/Llama-2-7b-chat-hf \
--enable-metrics

Prometheus 指标

vLLM 暴露以下 Prometheus 指标:

  • vllm:num_requests_running:正在处理的请求数
  • vllm:num_requests_waiting:等待处理的请求数
  • vllm:gpu_cache_usage_perc:GPU 缓存使用率
  • vllm:time_to_first_token_seconds:首 token 延迟
  • vllm:time_per_output_token_seconds:每个输出 token 的耗时

日志配置

import logging

logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# vLLM 日志
vllm_logger = logging.getLogger("vllm")
vllm_logger.setLevel(logging.DEBUG)

小结

本章介绍了 vLLM API 服务的部署方法:

  1. 快速启动:使用 vllm serve 命令启动服务
  2. API 调用:支持 OpenAI 兼容的 Completions 和 Chat Completions API
  3. 生产部署:Docker 和 Kubernetes 部署方案
  4. 安全认证:API 密钥和请求限制
  5. 监控日志:Prometheus 指标和日志配置