强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

vLLM 高性能推理部署指南 / 12 - Kubernetes 部署

12 - Kubernetes 部署

在 Kubernetes 上构建生产级的 vLLM 推理服务集群。


12.1 Kubernetes 部署架构

12.1.1 整体架构

                     ┌─────────────────────────┐
                     │      Ingress Controller  │
                     │     (Nginx/Traefik)      │
                     └────────────┬────────────┘
                                  │
                     ┌────────────▼────────────┐
                     │      Service (ClusterIP) │
                     │      load-balancing      │
                     └────────────┬────────────┘
                 ┌────────────────┼────────────────┐
                 ▼                ▼                ▼
          ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
          │  Pod (GPU 0) │ │  Pod (GPU 1)│ │  Pod (GPU 2)│
          │ ┌───────────┐│ │ ┌───────────┐│ │ ┌───────────┐│
          │ │vLLM Server││ │ │vLLM Server││ │ │vLLM Server││
          │ │  Model A   ││ │ │  Model B   ││ │ │  Model A   ││
          │ └───────────┘│ │ └───────────┘│ │ └───────────┘│
          └─────────────┘ └─────────────┘ └─────────────┘
                 │                │                │
          ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
          │   GPU x1     │ │   GPU x1     │ │   GPU x1     │
          └─────────────┘ └─────────────┘ └─────────────┘

          ┌───────────────────────────────────────────┐
          │     Monitoring: Prometheus + Grafana       │
          │     Autoscaling: HPA / KEDA               │
          │     Storage: PVC for model cache           │
          └───────────────────────────────────────────┘

12.2 GPU Operator 安装

12.2.1 安装 NVIDIA GPU Operator

# 添加 Helm 仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 安装 GPU Operator
helm install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator \
    --create-namespace \
    --set driver.enabled=true \
    --set toolkit.enabled=true \
    --set dcgm.enabled=true \
    --set dcgmExporter.enabled=true

12.2.2 验证 GPU 可用性

# 检查 GPU 节点
kubectl get nodes -l nvidia.com/gpu.present=true

# 检查 GPU 资源
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
# nvidia.com/gpu: 8

# 测试 GPU Pod
kubectl run gpu-test --rm -it \
    --image=nvidia/cuda:12.4.0-base-ubuntu22.04 \
    --limits=nvidia.com/gpu=1 \
    -- nvidia-smi

12.3 基础 Kubernetes 部署

12.3.1 ConfigMap

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
  namespace: llm
data:
  MODEL_NAME: "Qwen/Qwen2.5-7B-Instruct"
  MAX_MODEL_LEN: "4096"
  GPU_MEMORY_UTILIZATION: "0.9"
  DTYPE: "auto"
  TRUST_REMOTE_CODE: "true"

12.3.2 Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: llm
  labels:
    app: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      # Node 选择器:指定 GPU 节点
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-A100-80GB-PCIe"
      
      # 容忍 GPU 节点的 taint
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      
      # 亲和性:不同 Pod 分散到不同节点
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - vllm-server
                topologyKey: kubernetes.io/hostname
      
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          
          # GPU 资源请求
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "64Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: "1"
              memory: "32Gi"
              cpu: "4"
          
          # 启动命令
          command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "$(MODEL_NAME)"
            - "--max-model-len"
            - "$(MAX_MODEL_LEN)"
            - "--gpu-memory-utilization"
            - "$(GPU_MEMORY_UTILIZATION)"
            - "--served-model-name"
            - "qwen-7b"
            - "--trust-remote-code"
          
          # 环境变量
          envFrom:
            - configMapRef:
                name: vllm-config
          
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          
          # 端口
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          
          # 健康检查
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300  # 模型加载需要时间
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
            timeoutSeconds: 5
          
          # 共享内存
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
      
      # 共享内存卷
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "16Gi"
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
      
      # 启动超时(模型加载可能很慢)
      terminationGracePeriodSeconds: 60

12.3.3 Service

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: llm
  labels:
    app: vllm-server
spec:
  type: ClusterIP
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
      name: http
  selector:
    app: vllm-server

12.3.4 Secret(HF Token)

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: llm
type: Opaque
data:
  token: <base64-encoded-hf-token>
# 创建 Secret
kubectl create secret generic hf-token \
    --namespace llm \
    --from-literal=token=hf_xxxxxxxxxxxxx

12.3.5 PVC(模型缓存)

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: llm
spec:
  accessModes:
    - ReadWriteOnce  # 或 ReadWriteMany(如果存储支持)
  storageClassName: fast-ssd  # 使用 SSD 存储类
  resources:
    requests:
      storage: 200Gi

12.4 Helm Chart

12.4.1 Chart 结构

vllm-chart/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── hpa.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── pvc.yaml
│   └── _helpers.tpl
└── README.md

12.4.2 values.yaml

# values.yaml
replicaCount: 2

image:
  repository: vllm/vllm-openai
  tag: latest
  pullPolicy: IfNotPresent

model:
  name: "Qwen/Qwen2.5-7B-Instruct"
  servedName: "qwen-7b"
  maxModelLen: 4096
  gpuMemoryUtilization: 0.9
  trustRemoteCode: true
  dtype: "auto"

gpu:
  count: 1
  type: "NVIDIA-A100-80GB-PCIe"

resources:
  limits:
    memory: "64Gi"
    cpu: "8"
  requests:
    memory: "32Gi"
    cpu: "4"

service:
  type: ClusterIP
  port: 8000

ingress:
  enabled: true
  className: nginx
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
  hosts:
    - host: llm.example.com
      paths:
        - path: /
          pathType: Prefix

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetGPUUtilizationPercentage: 80

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 10s

shmSize: "16Gi"

modelCache:
  enabled: true
  size: 200Gi
  storageClass: fast-ssd

hfToken:
  enabled: true
  existingSecret: ""

12.4.3 安装 Helm Chart

# 添加自定义 Helm 仓库
helm repo add vllm https://your-registry.com/charts
helm repo update

# 安装
helm install vllm-service vllm/vllm-chart \
    --namespace llm \
    --create-namespace \
    --values values.yaml

# 更新
helm upgrade vllm-service vllm/vllm-chart \
    --namespace llm \
    --values values.yaml

# 查看状态
helm status vllm-service -n llm

12.5 自动扩缩容

12.5.1 HPA(基于 CPU/内存)

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: vllm:num_requests_waiting
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12.5.2 KEDA(基于自定义指标)

KEDA 提供更灵活的扩缩容策略:

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: llm
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 20
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    # 基于等待队列长度
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_num_requests_waiting
        query: avg(vllm:num_requests_waiting{model="qwen-7b"})
        threshold: "50"
    
    # 基于 GPU 缓存使用率
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_gpu_cache_usage
        query: avg(vllm:gpu_cache_usage_perc{model="qwen-7b"})
        threshold: "0.85"
    
    # 基于时间(定时扩缩)
    - type: cron
      metadata:
        timezone: Asia/Shanghai
        start: "0 9 * * *"    # 每天 9:00 扩容
        end: "0 22 * * *"     # 每天 22:00 缩容
        desiredReplicas: "5"

12.5.3 安装 KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
    --namespace keda \
    --create-namespace

12.6 GPU 调度策略

12.6.1 GPU 分配方式

方式说明适用场景
整卡分配nvidia.com/gpu: 1单 Pod 独占 GPU
MIG 分区A100/H100 的 MIG 实例多个小模型共享
GPU 时间片通过 GPU 共享开发/测试环境
vGPUNVIDIA vGPU 软件企业级共享

12.6.2 MIG(Multi-Instance GPU)配置

# 使用 MIG 实例(A100 80GB 的 1g.10gb 实例)
resources:
  limits:
    nvidia.com/mig-1g.10gb: "1"
# 配置 MIG 模式(在节点上执行)
sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi -i 0 -mig -cgi 9,9,9,9,9,9,9 -C

# 在 K8s 中使用 MIG
# 需要 GPU Operator 启用 MIG 策略
helm upgrade gpu-operator nvidia/gpu-operator \
    --set mig.strategy=mixed

12.7 网络配置

12.7.1 Ingress 配置

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: llm
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
    nginx.ingress.kubernetes.io/server-snippet: |
      proxy_cache off;
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - llm.example.com
      secretName: llm-tls
  rules:
    - host: llm.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 8000

关键:流式输出需要禁用 Nginx 的 proxy_buffering,否则 SSE 事件会被缓冲。


12.8 多模型部署

12.8.1 独立 Deployment

# 多模型:每个模型一个 Deployment

# 模型 A
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen-7b
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args: ["--model", "Qwen/Qwen2.5-7B-Instruct", "--served-model-name", "qwen-7b"]
          resources:
            limits:
              nvidia.com/gpu: "1"

# 模型 B
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen-coder
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args: ["--model", "Qwen/Qwen2.5-Coder-7B-Instruct", "--served-model-name", "qwen-coder"]
          resources:
            limits:
              nvidia.com/gpu: "1"

12.9 高可用配置

12.9.1 Pod Disruption Budget

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: llm
spec:
  minAvailable: 1  # 至少保持 1 个 Pod 运行
  selector:
    matchLabels:
      app: vllm-server

12.9.2 优雅重启

# 确保更新时的零停机
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 120  # 足够长的优雅关闭时间
      containers:
        - name: vllm
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 30"]  # 等待负载均衡器切换

12.10 注意事项

共享内存:vLLM 需要大量共享内存(/dev/shm)。Kubernetes 默认的 64MB 不够,必须通过 emptyDir 设置。

模型加载时间:大模型加载可能需要 5-15 分钟。设置足够的 initialDelaySeconds 避免 Pod 被误判为不健康。

GPU 调度:确保集群有足够的 GPU 资源。使用 nvidia.com/gpu 资源请求而非通用 CPU 请求。

存储性能:模型加载速度受存储影响。建议使用本地 SSD 或高性能 NFS/CSI 驱动。

网络:如果使用张量并行(多 GPU),Pod 内的 GPU 间通信需要 NVLink。跨节点的 TP 需要高速网络。


12.11 扩展阅读


上一章11 - 监控与可观测性 | 下一章13 - Docker 容器化部署