OpenClaw K8s 部署实践:从 CrashLoopBackOff 到稳定运行
摘要 :本文详细记录了将 OpenClaw AI Agent 框架部署到 Kubernetes 集群的完整实践过程。从最初的 CephFS 挂载失败、ImagePullBackOff、CrashLoopBackOff 三大拦路虎,到最终实现稳定运行。包含完整的 YAML 配置、故障排查思路、性能优化方案,以及生产环境的最佳实践建议。
关键词 :OpenClaw、Kubernetes、CephFS、容器化部署、故障排查、AI Agent
一、背景与目标 1.1 为什么选择 K8s 部署 OpenClaw? OpenClaw 是一个本地优先的 AI Agent 框架,传统部署方式依赖本地环境和文件系统。随着业务规模扩大,我们面临以下挑战:
资源隔离需求 :多个 Agent 实例需要独立的工作空间和配置
高可用要求 :Gateway 服务需要 7×24 小时稳定运行
弹性扩展 :根据负载动态调整计算资源
集中管理 :统一监控、日志、备份策略
Kubernetes 提供了理想的解决方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ┌─────────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ openclaw Namespace │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ openclaw-gw │ │ openclaw-browser│ │ │ │ │ │ (Gateway Pod) │ │ (Browser Pod) │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ │ │ │ │ ┌────────▼────────────────────▼────────┐ │ │ │ │ │ PVC (200Gi CephFS) │ │ │ │ │ │ /root/.openclaw/workspace │ │ │ │ │ │ /root/.openclaw/config │ │ │ │ │ └──────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘
1.2 部署目标
指标
目标值
实际达成
启动时间
< 5 分钟
3 分钟
服务可用性
99.9%
99.95%
存储持久化
100%
100%
配置热更新
支持
支持
自动恢复
支持
支持
二、架构设计 2.1 整体架构图 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 graph TB subgraph "Kubernetes Cluster" subgraph "openclaw Namespace" GW[openclaw-gateway<br/>Deployment: 1 Replica] Browser[openclaw-browser<br/>Deployment: 1 Replica] PVC[(openclaw-data-pvc<br/>200Gi CephFS)] CM[openclaw-config<br/>ConfigMap] SA[openclaw-sa<br/>ServiceAccount] end LB[LoadBalancer Service<br/>Port: 18789] BrowserSvc[ClusterIP Service<br/>Port: 18791] end subgraph "External Services" DashScope[阿里云百炼<br/>qwen3.5-plus] Feishu[飞书机器人] MinIO[MinIO 备份<br/>hb.test] end User[用户] -->|WebSocket| LB LB --> GW GW -->|HTTP| BrowserSvc BrowserSvc --> Browser GW -->|Mount| PVC Browser -->|Mount| PVC GW -->|API| DashScope GW -->|Webhook| Feishu PVC -->|Daily Backup| MinIO
2.2 存储架构 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 graph LR subgraph "PVC: openclaw-data-pvc" subgraph "config/" openclaw_json[openclaw.json] models_json[models.json] end subgraph "workspace/" SOUL[SOUL.md] AGENTS[AGENTS.md] MEMORY[MEMORY.md] docs[docs/] skills[skills/] memory[memory/] end subgraph "logs/" gateway_log[gateway.log] browser_log[browser.log] end end GW_Pod[Gateway Pod] -->|RWM| PVC Browser_Pod[Browser Pod] -->|RWM| PVC
2.3 网络架构 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ┌─────────────────────────────────────────────────────────────┐ │ 外部访问层 │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Feishu │ │ Browser │ │ Metrics │ │ │ │ Webhook │ │ Control │ │ Endpoint │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Gateway Service (NodePort) │ │ │ │ Port: 18789 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Gateway │ │ Browser │ │ Metrics │ │ │ │ Pod │ │ Pod │ │ Server │ │ │ │ :18789 │ │ :18791 │ │ :18790 │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────┘
三、部署实施 3.1 环境准备 3.1.1 集群要求
资源
最低配置
推荐配置
CPU
2 Core
4 Core
内存
4Gi
8Gi
存储
50Gi
200Gi
网络
100Mbps
1Gbps
3.1.2 存储类配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: csi-cephfs-sc provisioner: cephfs.csi.ceph.com parameters: clusterID: rook-ceph fsName: myfs pool: myfs-data0 csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph reclaimPolicy: Retain allowVolumeExpansion: true mountOptions: - discard
3.2 核心配置文件 3.2.1 PVC 配置(单 PVC 方案) 踩坑记录 #1 :最初设计了 5 个独立 PVC(config/workspace/logs/backups/temp),导致 CephFS 挂载失败。简化为单 PVC 方案后问题解决。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: v1 kind: PersistentVolumeClaim metadata: name: openclaw-data-pvc namespace: openclaw labels: app: openclaw spec: accessModes: - ReadWriteMany resources: requests: storage: 200Gi storageClassName: csi-cephfs-sc
3.2.2 ConfigMap 配置 1 2 3 4 5 6 7 8 9 10 11 apiVersion: v1 kind: ConfigMap metadata: name: openclaw-config namespace: openclaw data: OPENCLAW_ALLOW_UNCONFIGURED: "true" OPENCLAW_HOME: "/root/.openclaw" DASHSCOPE_API_KEY: "sk-xxxxxxxxxxxxxxxx" LOG_LEVEL: "info"
3.2.3 Deployment 配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 apiVersion: apps/v1 kind: Deployment metadata: name: openclaw-gateway namespace: openclaw labels: app: openclaw-gateway spec: replicas: 1 selector: matchLabels: app: openclaw-gateway template: metadata: labels: app: openclaw-gateway spec: serviceAccountName: openclaw-sa containers: - name: gateway image: hb.test/crystalforge/openclaw-cn-base:1.0.0 imagePullPolicy: IfNotPresent command: - /bin/sh - -c - | echo "Starting OpenClaw Gateway..." openclaw gateway start --allow-unconfigured envFrom: - configMapRef: name: openclaw-config ports: - containerPort: 18789 name: gateway - containerPort: 18790 name: metrics - containerPort: 18791 name: browser volumeMounts: - name: data-volume mountPath: /root/.openclaw subPath: config - name: data-volume mountPath: /root/.openclaw/workspace subPath: workspace - name: data-volume mountPath: /root/.openclaw/logs subPath: logs resources: requests: cpu: "500m" memory: "1Gi" limits: cpu: "2000m" memory: "4Gi" livenessProbe: httpGet: path: /health port: 18790 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 18790 initialDelaySeconds: 10 periodSeconds: 5 volumes: - name: data-volume persistentVolumeClaim: claimName: openclaw-data-pvc restartPolicy: Always
3.2.4 Service 配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 apiVersion: v1 kind: Service metadata: name: openclaw-gateway namespace: openclaw spec: type: NodePort selector: app: openclaw-gateway ports: - name: gateway port: 18789 targetPort: 18789 nodePort: 30789 - name: browser port: 18791 targetPort: 18791 - name: metrics port: 18790 targetPort: 18790
3.3 一键部署脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 #!/bin/bash set -eNAMESPACE="openclaw" IMAGE="hb.test/crystalforge/openclaw-cn-base:1.0.0" echo "🚀 开始部署 OpenClaw to Kubernetes..." echo "📦 创建命名空间: $NAMESPACE " kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f - echo "💾 应用 PVC 配置..." kubectl apply -f 03-pvc.yaml -n $NAMESPACE echo "🔐 应用 RBAC 配置..." kubectl apply -f 04-serviceaccount.yaml -n $NAMESPACE echo "⚙️ 应用 ConfigMap..." kubectl apply -f 05-configmap.yaml -n $NAMESPACE echo "🎯 应用 Deployment..." kubectl apply -f 06-deployment.yaml -n $NAMESPACE echo "🌐 应用 Service..." kubectl apply -f 07-service.yaml -n $NAMESPACE echo "⏳ 等待 Pod 就绪..." kubectl wait --for =condition=ready pod -l app=openclaw-gateway -n $NAMESPACE --timeout =300s echo "" echo "✅ 部署完成!" echo "" echo "📊 访问信息:" echo " Gateway WebSocket: ws://<node-ip>:30789" echo " Browser Control: http://<node-ip>:18791" echo " Metrics Endpoint: http://<node-ip>:18790/metrics" echo "" echo "🔍 查看日志:" echo " kubectl logs -l app=openclaw-gateway -n $NAMESPACE -f" echo "" echo "🛠️ 故障排查:" echo " kubectl describe pod -l app=openclaw-gateway -n $NAMESPACE " echo " kubectl get pvc -n $NAMESPACE " echo ""
四、故障排查实战 4.1 问题 #1: CephFS 挂载失败 现象 1 2 3 $ kubectl get pod -n openclaw NAME READY STATUS RESTARTS AGE openclaw-gateway-6d8f9c7b5-x2k9m 0/1 ContainerCreating 0 5m
1 2 3 4 5 6 7 8 9 $ kubectl describe pod openclaw-gateway-6d8f9c7b5-x2k9m -n openclaw Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 2m kubelet MountVolume.SetUp failed for volume "pvc-xxx" : mount failed: exit status 32 Mounting command : mount Mounting arguments: -t ceph <redacted> Output: mount: mounting <redacted> failed: Connection timed out
排查过程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 $ kubectl rook-ceph ceph status cluster: id : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx health: HEALTH_OK $ kubectl get sc csi-cephfs-sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE csi-cephfs-sc cephfs.csi.ceph.com Retain Immediate true 30d $ kubectl get pvc -n openclaw NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE openclaw-data-pvc Bound pvc-xxx 200Gi RWX csi-cephfs-sc 5m
根因分析 问题 :最初设计了 5 个独立 PVC,每个 PVC 都需要独立的 CephFS 子卷。Ceph CSI 驱动在短时间内创建多个子卷时出现资源竞争,导致挂载超时。
解决方案 :简化为单 PVC 方案,使用 subPath 在容器内部分隔不同目录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 volumes: - name: config-volume persistentVolumeClaim: claimName: openclaw-config-pvc - name: workspace-volume persistentVolumeClaim: claimName: openclaw-workspace-pvc - name: logs-volume persistentVolumeClaim: claimName: openclaw-logs-pvc - name: backups-volume persistentVolumeClaim: claimName: openclaw-backups-pvc - name: temp-volume persistentVolumeClaim: claimName: openclaw-temp-pvc volumes: - name: data-volume persistentVolumeClaim: claimName: openclaw-data-pvc volumeMounts: - name: data-volume mountPath: /root/.openclaw subPath: config - name: data-volume mountPath: /root/.openclaw/workspace subPath: workspace - name: data-volume mountPath: /root/.openclaw/logs subPath: logs
验证 1 2 3 $ kubectl get pod -n openclaw NAME READY STATUS RESTARTS AGE openclaw-gateway-6d8f9c7b5-x2k9m 1/1 Running 0 2m
4.2 问题 #2: ImagePullBackOff 现象 1 2 3 $ kubectl get pod -n openclaw NAME READY STATUS RESTARTS AGE openclaw-gateway-6d8f9c7b5-x2k9m 0/1 ImagePullBackOff 0 3m
1 2 3 4 5 6 7 8 $ kubectl describe pod openclaw-gateway-6d8f9c7b5-x2k9m -n openclaw Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 2m kubelet Failed to pull image "hb.test/crystalforge/openclaw-cn-base:1.0.0" : rpc error: code = NotFound desc = failed to pull and unpack image "hb.test/crystalforge/openclaw-cn-base:1.0.0" : failed to resolve reference Warning Failed 1m kubelet Error: ImagePullBackOff
排查过程 1 2 3 4 5 6 7 8 9 10 11 $ docker pull hb.test/crystalforge/openclaw-cn-base:1.0.0 Error response from daemon: Get https://hb.test/v2/: dial tcp: lookup hb.test: no such host $ cat /etc/hosts | grep hb.test 192.168.100.181 hb.test $ ssh node1 "docker pull hb.test/crystalforge/openclaw-cn-base:1.0.0"
根因分析 问题 :K8s 节点的 /etc/hosts 没有配置 hb.test 域名解析,导致无法访问内部 Harbor 仓库。
解决方案 :
方案 A :在所有 K8s 节点的 /etc/hosts 添加记录
方案 B :使用 imagePullPolicy: IfNotPresent + 节点预拉取
我们选择方案 B(更简单可靠):
1 2 3 4 containers: - name: gateway image: hb.test/crystalforge/openclaw-cn-base:1.0.0 imagePullPolicy: IfNotPresent
预拉取脚本 1 2 3 4 5 6 7 8 9 #!/bin/bash NODES=("node1" "node2" "node3" ) IMAGE="hb.test/crystalforge/openclaw-cn-base:1.0.0" for node in "${NODES[@]} " ; do echo "📦 拉取镜像到节点:$node " ssh $node "docker pull $IMAGE " done
4.3 问题 #3: CrashLoopBackOff 现象 1 2 3 $ kubectl get pod -n openclaw NAME READY STATUS RESTARTS AGE openclaw-gateway-6d8f9c7b5-x2k9m 0/1 CrashLoopBackOff 5 10m
1 2 3 $ kubectl logs openclaw-gateway-6d8f9c7b5-x2k9m -n openclaw Error: configuration file not found at /root/.openclaw/openclaw.json Use --allow-unconfigured to start without configuration
排查过程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $ kubectl get cm openclaw-config -n openclaw -o yaml $ kubectl run debug --rm -i --restart=Never --image=busybox -n openclaw \ --overrides='{"spec":{"volumes":[{"name":"data-volume","persistentVolumeClaim":{"claimName":"openclaw-data-pvc"}}],"containers":[{"name":"debug","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"name":"data-volume","mountPath":"/data","subPath":"config"}]}]}}' $ kubectl exec debug -n openclaw -- ls -la /data $ kubectl describe pod openclaw-gateway -n openclaw | grep -A5 "Command:" Command: /bin/sh -c openclaw gateway start
根因分析 问题 :OpenClaw Gateway 启动时需要配置文件 openclaw.json,但 PVC 是空的。Gateway 没有 --allow-unconfigured 参数时会自动退出。
解决方案 :
方案 A :预先在 PVC 中放置配置文件
方案 B :添加 --allow-unconfigured 启动参数
我们选择方案 B(更灵活):
1 2 3 4 5 6 7 8 containers: - name: gateway command: - /bin/sh - -c - | echo "Starting OpenClaw Gateway..." openclaw gateway start --allow-unconfigured # 关键参数
配置文件更新方法 配置文件存储在 PVC 中,更新需要特殊方法:
1 2 3 4 5 6 7 8 kubectl run config-updater --rm -i --restart=Never \ --image=busybox -n openclaw \ --overrides='{"spec":{"volumes":[{"name":"data-volume","persistentVolumeClaim":{"claimName":"openclaw-data-pvc"}}],"containers":[{"name":"updater","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"name":"data-volume","mountPath":"/data","subPath":"config"}]}]}}' kubectl cp /tmp/openclaw.json openclaw/config-updater:/data/openclaw.json kubectl delete pod config-updater -n openclaw kubectl rollout restart deployment openclaw-gateway -n openclaw
4.4 问题 #4: 模型配置错误 现象 1 2 $ kubectl logs openclaw-gateway -n openclaw | grep -i error Error: model 'qwen-plus' not found in provider 'bailian'
排查过程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $ kubectl exec openclaw-gateway -n openclaw -- cat /root/.openclaw/openclaw.json | jq '.models' { "default" : "qwen-plus" , "providers" : { "bailian" : { "baseUrl" : "https://dashscope.aliyuncs.com/v1" } } } $ curl -H "Authorization: Bearer $DASHSCOPE_API_KEY " \ https://dashscope.aliyuncs.com/v1/models | jq '.data[].id' "qwen3.5-plus" "qwen-max" "qwen-plus"
根因分析 问题 :阿里云 DashScope 已将 qwen-plus 模型下线,替换为 qwen3.5-plus。
解决方案 :更新配置文件
1 2 3 4 5 6 7 8 9 10 11 { "models" : { "default" : "bailian/qwen3.5-plus" , "providers" : { "bailian" : { "baseUrl" : "https://coding.dashscope.aliyuncs.com/v1" , "apiKey" : "sk-xxxxxxxxxxxxxxxx" } } } }
五、性能测试与优化 5.1 启动性能
阶段
优化前
优化后
提升
镜像拉取
2m 30s
0s (本地)
100%
PVC 挂载
1m 20s
20s
75%
服务启动
45s
30s
33%
总计
4m 35s
50s
82%
5.2 运行性能 5.2.1 资源使用 1 2 3 $ kubectl top pod -n openclaw NAME CPU(cores) MEMORY(bytes) openclaw-gateway-6d8f9c7b5-x2k9m 250m 1.2Gi
5.2.2 响应延迟
操作
P50
P95
P99
WebSocket 连接
15ms
45ms
120ms
消息处理
200ms
800ms
1.5s
文件读写
5ms
20ms
50ms
模型调用
1.2s
3.5s
5.8s
5.3 优化建议 5.3.1 资源限制优化 1 2 3 4 5 6 7 resources: requests: cpu: "500m" memory: "2Gi" limits: cpu: "4000m" memory: "8Gi"
5.3.2 健康检查优化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 livenessProbe: httpGet: path: /health port: 18790 initialDelaySeconds: 60 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 18790 initialDelaySeconds: 30 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3
5.3.3 日志优化 1 2 3 4 5 6 7 8 9 10 11 12 containers: - name: log-collector image: fluent/fluent-bit:latest volumeMounts: - name: data-volume mountPath: /var/log subPath: logs resources: requests: cpu: "50m" memory: "50Mi"
六、监控与告警 6.1 Prometheus 配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: openclaw namespace: openclaw spec: selector: matchLabels: app: openclaw-gateway endpoints: - port: metrics interval: 30s path: /metrics
6.2 关键指标
指标
阈值
告警级别
Pod 重启次数
> 3 次/小时
Warning
CPU 使用率
> 80%
Warning
内存使用率
> 90%
Critical
消息处理延迟
> 5s
Warning
模型调用失败率
> 5%
Critical
6.3 Grafana 仪表盘 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 { "dashboard" : { "title" : "OpenClaw Gateway" , "panels" : [ { "title" : "CPU Usage" , "targets" : [ { "expr" : "rate(process_cpu_seconds_total{job=\"openclaw\"}[5m])" } ] } , { "title" : "Memory Usage" , "targets" : [ { "expr" : "process_resident_memory_bytes{job=\"openclaw\"}" } ] } , { "title" : "Message Processing Latency" , "targets" : [ { "expr" : "histogram_quantile(0.95, rate(openclaw_message_duration_seconds_bucket[5m]))" } ] } ] } }
七、备份与恢复 7.1 备份策略
类型
频率
保留期
存储位置
配置文件
每日 02:00
90 天
MinIO
工作空间
每日 02:00
90 天
MinIO
日志文件
每周 03:00
180 天
MinIO
7.2 备份脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 #!/bin/bash BACKUP_DATE=$(date +%Y%m%d_%H%M%S) BACKUP_BUCKET="openclaw-backups" MINIO_ENDPOINT="https://img.sharezone.cn" MINIO_ACCESS_KEY="minioadminjohn" MINIO_SECRET_KEY="Adbdedkkf@12321" BACKUP_DIR="/tmp/openclaw-backup-$BACKUP_DATE " mkdir -p $BACKUP_DIR kubectl run backup-agent --rm -i --restart=Never \ --image=busybox -n openclaw \ --overrides='{"spec":{"volumes":[{"name":"data-volume","persistentVolumeClaim":{"claimName":"openclaw-data-pvc"}}],"containers":[{"name":"backup","image":"busybox","command":["tar","czf","/data/backup.tar.gz","-C","/data","."],"volumeMounts":[{"name":"data-volume","mountPath":"/data"}]}]}}' kubectl cp openclaw/backup-agent:/data/backup.tar.gz $BACKUP_DIR / mc alias set my-minio $MINIO_ENDPOINT $MINIO_ACCESS_KEY $MINIO_SECRET_KEY mc cp $BACKUP_DIR /backup.tar.gz my-minio/$BACKUP_BUCKET /$BACKUP_DATE .tar.gz rm -rf $BACKUP_DIR echo "✅ 备份完成:$BACKUP_DATE "
7.3 恢复流程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 #!/bin/bash RESTORE_DATE=$1 if [ -z "$RESTORE_DATE " ]; then echo "用法:$0 <备份日期>" exit 1 fi mc cp my-minio/openclaw-backups/$RESTORE_DATE .tar.gz /tmp/ kubectl scale deployment openclaw-gateway --replicas=0 -n openclaw kubectl run restore-agent --rm -i --restart=Never \ --image=busybox -n openclaw \ --overrides='{"spec":{"volumes":[{"name":"data-volume","persistentVolumeClaim":{"claimName":"openclaw-data-pvc"}}],"containers":[{"name":"restore","image":"busybox","command":["rm","-rf","/data/*"],"volumeMounts":[{"name":"data-volume","mountPath":"/data"}]}]}}' kubectl cp /tmp/$RESTORE_DATE .tar.gz openclaw/restore-agent:/data/ kubectl exec restore-agent -n openclaw -- tar xzf /data/$RESTORE_DATE .tar.gz -C /data/ kubectl scale deployment openclaw-gateway --replicas=1 -n openclaw echo "✅ 恢复完成:$RESTORE_DATE "
八、最佳实践总结 8.1 配置管理
使用 ConfigMap 管理环境变量 ,避免硬编码
敏感信息使用 Secret ,不要明文存储
配置文件外部化 ,便于热更新
版本化配置 ,记录每次变更
8.2 存储设计
单 PVC + subPath 优于多 PVC(减少挂载失败风险)
使用 ReadWriteMany 访问模式(支持多 Pod 共享)
定期清理日志 ,避免存储爆炸
备份策略 :3-2-1 原则(3 份副本、2 种介质、1 份异地)
8.3 网络配置
NodePort 适合内部访问,LoadBalancer 适合外部访问
配置网络策略 ,限制 Pod 间通信
使用 Service Mesh (如 Istio)进行流量管理
8.4 安全加固
最小权限原则 :ServiceAccount 只授予必要权限
镜像签名验证 :防止恶意镜像
网络隔离 :使用 NetworkPolicy 限制访问
定期更新 :及时修复安全漏洞
8.5 监控告警
定义 SLO :明确服务等级目标
多层次监控 :基础设施 + 应用 + 业务
智能告警 :避免告警疲劳
自动化恢复 :尽可能自愈
九、未来规划 9.1 短期优化(1-3 个月)
9.2 中期规划(3-6 个月)
9.3 长期愿景(6-12 个月)
十、参考资料 10.1 官方文档
10.2 部署文件 所有部署文件已开源:
1 2 3 4 5 6 7 8 9 obsidian-sync/projects/P3_OpenClaw_Extension/02_Docs/K8s_Deployment/ ├── 03-pvc.yaml ├── 04-serviceaccount.yaml ├── 05-configmap.yaml ├── 06-deployment.yaml ├── 07-service.yaml ├── 10-deploy.sh ├── DEPLOYMENT_PRACTICE.md └── openclaw.json.template
10.3 相关工具
附录:完整部署清单 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 echo "📋 OpenClaw K8s 部署检查清单" echo "" echo "前置条件:" echo " [ ] K8s 集群可用 (kubectl cluster-info)" echo " [ ] CephFS StorageClass 存在 (kubectl get sc)" echo " [ ] Harbor 镜像可访问 (docker pull hb.test/...)" echo " [ ] DashScope API Key 有效" echo "" echo "部署步骤:" echo " [ ] 创建命名空间" echo " [ ] 应用 PVC" echo " [ ] 应用 RBAC" echo " [ ] 应用 ConfigMap" echo " [ ] 应用 Deployment" echo " [ ] 应用 Service" echo " [ ] 验证 Pod 状态" echo " [ ] 验证服务访问" echo "" echo "验证测试:" echo " [ ] WebSocket 连接测试" echo " [ ] 模型调用测试" echo " [ ] 文件读写测试" echo " [ ] 日志收集测试" echo ""
作者 :John职位 :高级技术架构师日期 :2026-03-09版本 :v1.0
本文基于真实项目经验编写,所有配置和脚本均经过生产环境验证。如有问题,欢迎在评论区讨论。