K8s AI 工作负载调度:GPU 资源管理最佳实践
写在前面:2026 年,AI 工作负载成为 K8s 集群的常见场景。这篇文章详解 K8s 上的 GPU 资源管理、弹性伸缩和成本优化实践。
一、AI 工作负载的特殊需求
1.1 与传统应用对比
| 特性 |
传统应用 |
AI 工作负载 |
| 资源类型 |
CPU/内存 |
GPU/TPU |
| 资源粒度 |
核心/GB |
GPU 卡/显存 |
| 调度策略 |
均衡分布 |
集中调度 |
| 弹性需求 |
水平扩展 |
垂直扩展 |
| 成本敏感度 |
中 |
高 |
1.2 核心挑战
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| 挑战 1:GPU 资源昂贵 - A100: $3-4/小时 - H100: $10-15/小时 - 闲置 = 浪费
挑战 2:资源碎片化 - GPU 0: 已用 80% - GPU 1: 已用 20% - 新任务无法调度
挑战 3:多租户竞争 - 团队 A:训练任务(高优先级) - 团队 B:推理任务(低优先级) - 如何公平分配?
|
二、GPU 资源管理
2.1 GPU 发现与注册
NVIDIA Device Plugin:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds template: spec: containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0 name: nvidia-device-plugin-ctr securityContext: allowPrivilegeEscalation: false volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
|
验证 GPU 资源:
1 2 3 4 5 6 7 8
| kubectl get nodes -l nvidia.com/gpu=present
kubectl describe node gpu-node-1 | grep -A 5 "Allocated resources"
nvidia.com/gpu 4 2 2
|
2.2 GPU 资源请求
Pod 配置示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| apiVersion: v1 kind: Pod metadata: name: gpu-training-job spec: containers: - name: training image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime resources: limits: nvidia.com/gpu: 1 memory: 32Gi cpu: 8 requests: nvidia.com/gpu: 1 memory: 16Gi cpu: 4
|
2.3 GPU 共享(MIG)
NVIDIA A100/H100 支持 MIG(多实例 GPU):
1 2 3 4 5 6 7 8 9 10 11
| apiVersion: v1 kind: Pod metadata: name: mig-job spec: containers: - name: training resources: limits: nvidia.com/mig-3g.20gb: 1
|
MIG 配置对比:
| 配置 |
计算单元 |
显存 |
适用场景 |
| 1g.5gb |
1 |
5GB |
推理、小规模训练 |
| 2g.10gb |
2 |
10GB |
中等规模训练 |
| 3g.20gb |
3 |
20GB |
大规模训练 |
| 7g.40gb |
7 |
40GB |
超大规模训练 |
三、弹性伸缩策略
3.1 水平伸缩(HPA)
基于 CPU/内存:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: inference-service minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
|
3.2 垂直伸缩(VPA)
自动调整资源请求:
1 2 3 4 5 6 7 8 9 10 11
| apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: training-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: training-job updatePolicy: updateMode: Auto
|
3.3 集群伸缩(Cluster Autoscaler)
自动增减节点:
1 2 3 4 5 6 7 8 9 10 11
| apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config namespace: kube-system data: scale-down-enabled: "true" scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m max-node-provision-time: 15m
|
四、多租户隔离
4.1 Namespace 隔离
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| apiVersion: v1 kind: Namespace metadata: name: team-a-ai labels: team: team-a gpu-quota: "4"
---
apiVersion: v1 kind: ResourceQuota metadata: name: gpu-quota namespace: team-a-ai spec: hard: requests.cpu: "32" requests.memory: 128Gi limits.cpu: "64" limits.memory: 256Gi nvidia.com/gpu: "4"
|
4.2 优先级调度
定义优先级类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false description: "高优先级任务(训练)"
--- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 100 globalDefault: false description: "低优先级任务(推理)"
|
使用优先级:
1 2 3 4 5 6 7 8 9
| apiVersion: v1 kind: Pod metadata: name: training-job spec: priorityClassName: high-priority containers: - name: training
|
4.3 节点亲和性
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| apiVersion: v1 kind: Pod metadata: name: gpu-job spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu-type operator: In values: - a100 - h100 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: gpu-count operator: Gt values: - "4"
|
五、成本优化技巧
5.1 闲时调度
利用夜间/周末闲时:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| apiVersion: batch/v1 kind: CronJob metadata: name: night-training spec: schedule: "0 2 * * *" jobTemplate: spec: template: spec: containers: - name: training resources: limits: nvidia.com/gpu: 4
|
5.2 抢占式实例
使用 Spot 实例降低成本:
1 2 3 4 5 6 7 8 9 10 11 12
| apiVersion: v1 kind: Pod metadata: name: spot-training annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" spec: tolerations: - key: "spot-instance" operator: "Equal" value: "true" effect: "NoSchedule"
|
成本对比:
| 实例类型 |
价格 |
可靠性 |
适用场景 |
| 按需实例 |
100% |
高 |
生产任务 |
| Spot 实例 |
30-50% |
中 |
训练任务 |
| 预留实例 |
60-70% |
高 |
长期任务 |
5.3 资源利用率优化
监控 GPU 利用率:
1 2 3 4 5 6 7 8
| helm install dcgm-exporter nvdp/dcgm-exporter
kubectl port-forward svc/dcgm-exporter 9400
DCGM_FI_DEV_GPU_UTIL{gpu="0"}
|
优化建议:
| 利用率 |
建议 |
| <30% |
减少 GPU 数量或使用 MIG |
| 30-70% |
合理 |
| >90% |
考虑增加 GPU |
六、实战案例
6.1 案例 1:大模型训练集群
需求:
- 8 块 A100 GPU
- 分布式训练
- 运行时间:3-5 天
配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| apiVersion: apps/v1 kind: StatefulSet metadata: name: llm-training spec: serviceName: llm-training replicas: 8 template: spec: containers: - name: training image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime resources: limits: nvidia.com/gpu: 1 memory: 80Gi env: - name: NNODES value: "8" - name: NPROC_PER_NODE value: "1"
|
效果:
- 训练时间:4 天
- GPU 利用率:85%
- 成本:$2,880(8×A100×$3/小时×96 小时)
6.2 案例 2:推理服务集群
需求:
- 低延迟(<100ms)
- 高并发(1000 QPS)
- 弹性伸缩
配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| apiVersion: apps/v1 kind: Deployment metadata: name: inference-service spec: replicas: 4 template: spec: containers: - name: inference image: triton-inference-server/server:23.05-py3 resources: limits: nvidia.com/gpu: 1 ports: - containerPort: 8000
--- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: inference-service minReplicas: 2 maxReplicas: 20 metrics: - type: Pods pods: metric: name: requests_per_second target: type: AverageValue averageValue: 100
|
效果:
- 平均延迟:50ms
- P99 延迟:90ms
- 自动伸缩:2-20 副本
七、最佳实践清单
7.1 资源管理
7.2 调度优化
7.3 成本控制
八、总结
8.1 核心要点
- GPU 资源管理 - Device Plugin + MIG
- 弹性伸缩 - HPA + VPA + Cluster Autoscaler
- 多租户隔离 - Namespace + ResourceQuota + PriorityClass
- 成本优化 - Spot 实例 + 闲时调度 + 利用率监控
8.2 工具推荐
| 工具 |
用途 |
| NVIDIA Device Plugin |
GPU 发现 |
| DCGM Exporter |
GPU 监控 |
| Cluster Autoscaler |
集群伸缩 |
| Kubecost |
成本监控 |
作者:John
创建时间:2026-03-10
文档版本:v1.0