KubeHeal/openshift-coordination-engine

GitHub: KubeHeal/openshift-coordination-engine

一个面向 OpenShift/Kubernetes 的多层修复协调引擎，自动化编排跨层事件响应并集成 ML 异常检测。

Stars: 0 | Forks: 2

# OpenShift 协调引擎 [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/53b2f8f6d7234034.svg)](https://github.com/KubeHeal/openshift-coordination-engine/actions) [![Container](https://quay.io/repository/takinosh/openshift-coordination-engine/status)](https://quay.io/repository/takinosh/openshift-coordination-engine) [![Go Report Card](https://goreportcard.com/badge/github.com/KubeHeal/openshift-coordination-engine)](https://goreportcard.com/report/github.com/KubeHeal/openshift-coordination-engine) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) 一个用于 OpenShift/Kubernetes 环境的多层修复协调引擎。协调跨基础设施、平台和应用层的自动化事件响应，并采用智能感知部署的修复策略。 ## 功能特性 - **多层协调**：协调跨基础设施（节点、MCO）、平台（Operator、SDN）和应用层的修复 - **感知部署**：检测部署方式（ArgoCD、Helm、Operator、手动）并应用适当的修复策略 - **GitOps 集成**：尊重 ArgoCD 工作流并以 Git 作为事实来源 - **ML 增强**：集成 Python ML 服务进行异常检测和预测分析 - **生产就绪**：内置健康检查、指标、RBAC 和优雅降级 ## 快速开始 ### 先决条件 - Go 1.21+ - Kubernetes 1.28+ 或 OpenShift 4.14+ - 已配置 kubectl/oc CLI ### 安装 #### 使用容器镜像 ``` # 拉取最新镜像 podman pull quay.io/takinosh/openshift-coordination-engine:latest # 使用 KServe 集成运行（推荐 - ADR-039） podman run -d \ -p 8080:8080 \ -p 9090:9090 \ -e ENABLE_KSERVE_INTEGRATION=true \ -e KSERVE_NAMESPACE=self-healing-platform \ -e KSERVE_ANOMALY_DETECTOR_SERVICE=anomaly-detector-predictor \ quay.io/takinosh/openshift-coordination-engine:latest ``` #### 使用 Helm ``` # 添加 Helm 仓库（如果可用） helm repo add coordination-engine https://github.com/KubeHeal/openshift-coordination-engine # 使用 KServe 集成安装（推荐） helm install coordination-engine ./charts/coordination-engine \ --namespace self-healing-platform \ --create-namespace ``` #### 从源码构建 ``` # 克隆仓库 git clone https://github.com/KubeHeal/openshift-coordination-engine.git cd openshift-coordination-engine # 构建 make build # 运行 ./bin/coordination-engine ``` ## 多版本支持该项目通过版本特定的发布分支和容器镜像支持多个 OpenShift 版本。 ### 支持的版本 | OpenShift | Kubernetes | 镜像标签 | 分支 | 状态 | |-----------|-----------|-----------|--------|--------| | 4.18 | 1.31 | `ocp-4.18-latest` | `release-4.18` | ✅ 支持 | | 4.19 | 1.32 | `ocp-4.19-latest` | `release-4.19` | ✅ 支持 | | 4.20 | 1.33 | `ocp-4.20-latest` | `release-4.20` | ✅ 支持（当前） | **支持策略**：滚动维护 3 个版本窗口。当 OpenShift 4.21 发布后，对 4.18 的支持将被移除。 ### 版本选择 #### 检查集群版本 ``` oc version # 服务器版本: 4.19.5 ``` #### 拉取特定版本镜像 ``` # 适用于 OpenShift 4.18 podman pull quay.io/takinosh/openshift-coordination-engine:ocp-4.18-latest # 适用于 OpenShift 4.19 podman pull quay.io/takinosh/openshift-coordination-engine:ocp-4.19-latest # 适用于 OpenShift 4.20 podman pull quay.io/takinosh/openshift-coordination-engine:ocp-4.20-latest ``` #### 特定 Git SHA 标签对于可重现的部署，请使用 SHA 标记的镜像： ``` # 示例：OpenShift 4.20（特定提交） podman pull quay.io/takinosh/openshift-coordination-engine:ocp-4.20-a1b2c3d ``` #### 使用 Helm 部署 ``` # OpenShift 4.18 helm install coordination-engine ./charts/coordination-engine \ --values ./charts/coordination-engine/values-ocp-4.18.yaml \ --namespace self-healing-platform # OpenShift 4.19 helm install coordination-engine ./charts/coordination-engine \ --values ./charts/coordination-engine/values-ocp-4.19.yaml \ --namespace self-healing-platform # OpenShift 4.20（或使用默认 values.yaml） helm install coordination-engine ./charts/coordination-engine \ --values ./charts/coordination-engine/values-ocp-4.20.yaml \ --namespace self-healing-platform ``` 或直接覆盖配置： ``` helm install coordination-engine ./charts/coordination-engine \ --set image.tag=ocp-4.19-latest \ --namespace self-healing-platform ``` **⚠️ 重要**：始终确保容器镜像版本与 OpenShift 集群版本匹配，以避免 Kubernetes API 兼容性问题。 ### 开发分支 - **main**：开发分支，自动同步到 `release-4.20` - **release-4.18**：支持 OpenShift 4.18（client-go v0.31.x） - **release-4.19**：支持 OpenShift 4.19（client-go v0.32.x） - **release-4.20**：支持 OpenShift 4.20（client-go v0.33.x） **注意**：直接在 `main` 上进行开发。变更会自动传播到 `release-4.20`，并在需要时回溯到旧版本。有关详细的版本策略文档，请参阅 [VERSION-STRATEGY.md](docs/VERSION-STRATEGY.md)。 ## 配置 ### 环境变量 #### 核心配置 | 变量 | 描述 | 默认值 | 是否必需 | |----------|-------------|---------|----------| | `PORT` | HTTP 服务器端口 | 8080 | 否 | | `METRICS_PORT` | Prometheus 指标端口 | 9090 | 否 | | `LOG_LEVEL` | 日志级别 | info | 否 | | `NAMESPACE` | Kubernetes 命名空间 | self-healing-platform | 否 | | `ARGOCD_API_URL` | ArgoCD API 端点 | 自动检测 | 否 | | `KUBECONFIG` | Kubernetes 配置文件 | 集群内配置 | 否 | #### KServe 集成（ADR-039 - 推荐） | 变量 | 描述 | 默认值 | 是否必需 | |----------|-------------|---------|----------| | `ENABLE_KSERVE_INTEGRATION` | 启用 KServe 集成 | true | 否 | | `KSERVE_NAMESPACE` | KServe 推理服务命名空间 | self-healing-platform | 是* | | `KSERVE_ANOMALY_DETECTOR_SERVICE` | 异常检测器服务名称 | - | 是* | | `KSERVE_PREDICTIVE_ANALYTICS_SERVICE` | 预测分析服务名称 | - | 否 | | `KSERVE_TIMEOUT` | KServe API 调用超时 | 10s | 否 | *当 `ENABLE_KSERVE_INTEGRATION=true` 时必需 **KServe 配置示例：** ``` export ENABLE_KSERVE_INTEGRATION=true export KSERVE_NAMESPACE=self-healing-platform export KSERVE_ANOMALY_DETECTOR_SERVICE=anomaly-detector-predictor export KSERVE_PREDICTIVE_ANALYTICS_SERVICE=predictive-analytics-predictor ``` #### 旧版 ML 服务（已弃用） | 变量 | 描述 | 默认值 | 是否必需 | |----------|-------------|---------|----------| | `ML_SERVICE_URL` | Python ML 服务端点（已弃用） | - | 否* | *仅在 `ENABLE_KSERVE_INTEGRATION=false` 时必需 **⚠️ 注意**：`ML_SERVICE_URL` 已弃用。请改用 KServe 集成（ADR-039）。 ## 部署先决条件 ### KServe 模型依赖协调引擎在启动前要求 KServe 推理服务已部署并处于健康状态。 **1. 验证 KServe 推理服务是否存在：** ``` kubectl get inferenceservice -n self-healing-platform # 预期输出： # NAME URL READY PREV LATEST AGE # anomaly-detector ... True 100 5m # predictive-analytics ... True 100 5m ``` **2. 验证推理服务 Pod 是否正在运行：** ``` kubectl get pods -n self-healing-platform -l serving.kserve.io/inferenceservice # 预期：所有 Pod 处于运行状态且 2/2 容器已就绪 ``` **3. 验证模型文件是否存在：** - **基于 PVC**：确保 `model-storage-pvc` 包含模型子目录 kubectl exec -n self-healing-platform deployment/model-uploader -- ls -la /models - **基于 S3**：验证 S3 存储桶包含模型工件 **4. 手动测试模型端点：** ``` kubectl run -it --rm curl --image=curlimages/curl -n self-healing-platform -- \ curl http://predictive-analytics-predictor.self-healing-platform.svc:8080/v1/models/model # 预期：包含模型元数据的 JSON 响应 ``` **故障排查 404 错误：** 如果出现 "KServe model 'model' does not exist" 错误： 1. 检查推理服务状态：`kubectl describe inferenceservice -n self-healing-platform` 2. 检查 Pod 日志：`kubectl logs -n self-healing-platform -l serving.kserve.io/inferenceservice=` 3. 验证模型文件是否存在于 PVC 或 S3 中 4. 确保协调引擎在 KServe 模型之后部署（ArgoCD 同步波次 3） ### RBAC 设置协调引擎需要特定的 Kubernetes 权限。请应用 RBAC 清单： ``` kubectl apply -f charts/coordination-engine/templates/serviceaccount.yaml kubectl apply -f charts/coordination-engine/templates/role.yaml kubectl apply -f charts/coordination-engine/templates/rolebinding.yaml ``` 请参阅 [RBAC 文档](docs/RBAC.md) 了解详细权限说明。 ## API 端点 ### 健康检查协调引擎提供两个健康端点，遵循 Kubernetes 最佳实践： ``` # 轻量级健康检查（Kubernetes 标准） curl http://localhost:8080/health # 详细健康检查（带依赖监控） curl http://localhost:8080/api/v1/health ``` 请参阅 [API 文档](docs/API.md) 获取完整的端点详情。 ### 触发修复 ``` curl -X POST http://localhost:8080/api/v1/remediation/trigger \ -H "Content-Type: application/json" \ -d '{ "namespace": "production", "resource_type": "pod", "resource_name": "my-app-abc123", "issue_type": "CrashLoopBackOff", "severity": "high" }' ``` ### 列出事件 ``` curl http://localhost:8080/api/v1/incidents?namespace=production&status=active ``` ### 获取工作流状态 ``` curl http://localhost:8080/api/v1/workflows/wf-12345678 ``` 请参阅 [API 参考](docs/API.md) 获取完整的 API 参考。 ## 架构协调引擎由以下组件组成： - **层检测器**：识别受影响的层（基础设施、平台、应用） - **部署检测器**：确定应用程序的部署方式（ArgoCD、Helm、Operator、手动） - **多层规划器**：创建跨层的有序修复计划 - **策略选择器**：根据部署方式路由到适当的修复器- **修复器**：执行特定于部署的修复（ArgoCD 同步、Helm 回滚等） - **健康检查器**：在修复后验证各层的系统状态请参阅 [架构文档](docs/adrs/README.md) 了解详细的设计决策。 ## 开发 ### 构建 ``` make build ``` ### 测试 ``` # 单元测试 make test # 集成测试 make test-integration # E2E 测试 make test-e2e # 覆盖率 make coverage ``` ### 代码检查 ``` make lint make fmt ``` ## 部署 ### OpenShift ``` # 应用 RBAC oc apply -f charts/coordination-engine/templates/serviceaccount.yaml oc apply -f charts/coordination-engine/templates/role.yaml oc apply -f charts/coordination-engine/templates/rolebinding.yaml # 通过 Helm 部署 helm install coordination-engine ./charts/coordination-engine \ --set image.repository=quay.io/takinosh/openshift-coordination-engine \ --set image.tag=latest \ --set mlServiceUrl=http://aiops-ml-service:8080 \ --namespace self-healing-platform ``` ### 验证 ``` # 检查 Pod 状态 kubectl get pods -n self-healing-platform # 检查健康端点 kubectl port-forward svc/coordination-engine 8080:8080 -n self-healing-platform curl http://localhost:8080/health # Lightweight check # 查看日志 kubectl logs -f deployment/coordination-engine -n self-healing-platform # 检查指标 curl http://localhost:9090/metrics ``` ## 监控协调引擎在 9090 端口公开 Prometheus 指标： - `coordination_engine_remediation_total` - 总修复尝试次数 - `coordination_engine_remediation_duration_seconds` - 修复持续时间 - `coordination_engine_argocd_sync_total` - ArgoCD 同步操作 - `coordination_engine_ml_layer_detection_total` - ML 增强检测请参阅 [监控指南](docs/MONITORING.md) 获取完整的指标参考。 ## 故障排查 ### 常见问题 **RBAC 权限被拒绝** ``` # 验证权限 kubectl auth can-i get pods --as=system:serviceaccount:self-healing-platform:self-healing-operator kubectl auth can-i patch deployments --as=system:serviceaccount:self-healing-platform:self-healing-operator ``` **ML 服务连接失败** ``` # 检查 ML 服务健康 curl http://aiops-ml-service:8080/health # 验证网络连接 kubectl exec -it deployment/coordination-engine -- curl http://aiops-ml-service:8080/health ``` **ArgoCD 集成无法工作** ``` # 检查 ArgoCD API 访问 oc get applications -n openshift-gitops # 验证 ArgoCD URL 配置 kubectl get deployment coordination-engine -n self-healing-platform -o yaml | grep ARGOCD_API_URL ``` ## 贡献欢迎贡献！请阅读 [CONTRIBUTING.md](CONTRIBUTING.md) 获取指南。 1. 叉取仓库 2. 创建功能分支 (`git checkout -b feature/amazing-feature`) 3. 提交更改 (`git commit -m 'Add amazing feature'`) 4. 推送到分支 (`git push origin feature/amazing-feature`) 5. 发起拉取请求 ## 许可证本项目根据 Apache License 2.0 授权 - 详见 [LICENSE](LICENSE) 文件。 ## 文档 - [架构决策记录](docs/adrs/README.md) - [API 参考](docs/API.md) - [RBAC 配置](docs/RBAC.md) - [开发指南](CLAUDE.md) - [实现状态](docs/IMPLEMENTATION-PLAN.md) ## 支持 - **问题**：[GitHub Issues](https://github.com/KubeHeal/openshift-coordination-engine/issues) - **讨论**：[GitHub Discussions](https://github.com/KubeHeal/openshift-coordination-engine/discussions) ## 感谢基于以下技术构建： - [client-go](https://github.com/kubernetes/client-go) - Kubernetes Go 客户端 - [Gorilla Mux](https://github.com/gorilla/mux) - HTTP 路由 - [Logrus](https://github.com/sirupsen/logrus) - 结构化日志 - [Prometheus](https://prometheus.io/) - 指标和监控

标签：ADRs, Apex, ArgoCD, EVTX分析, GitOps, Go语言, Helm, KServe集成, KubeHeal, ML异常检测, OpenShift, Operator, RBAC, 健康检查, 基础设施, 多层修复, 子域名突变, 容器镜像, 平台层, 应用层, 弹性降级, 手动部署, 指标监控, 日志审计, 机器学习, 生产就绪, 程序破解, 编排引擎, 自动化响应, 自定义请求头, 部署感知, 预测分析