absence77/ai-agents-production

GitHub: absence77/ai-agents-production

面向 Kubernetes 生产环境的自主 AI 智能体事件响应系统，通过四阶段流水线实现从检测到修复的全自动化闭环处理。

Stars: 0 | Forks: 0

# Kubernetes 自主 AI Agent —— 生产系统 [![Medium](https://img.shields.io/badge/Medium-13--Part%20Series-black?logo=medium)](https://medium.com/@ahmadgayibov) [![GitHub](https://img.shields.io/badge/GitHub-absence77-181717?logo=github)](https://github.com/absence77/ai-agents-production) [![Anthropic](https://img.shields.io/badge/Powered%20by-Claude%20API-orange)](https://docs.anthropic.com) [![Kubernetes](https://img.shields.io/badge/Kubernetes-v1.30.14-blue?logo=kubernetes)](https://kubernetes.io) [![Python](https://img.shields.io/badge/Python-3.12-green?logo=python)](https://python.org) ## Pipeline 架构 ``` Kubernetes Event | v +-------------+ +---------------+ +-----------+ +-------------+ | Agent-01 | --> | Agent-02 | --> | Agent-04 | --> | Agent-03 | | Detector | | Researcher | | Judge | | Executor | | | | | | 🐙 | | | | Classifies | | Root cause | | Evaluates | | Applies fix | | incident | | analysis + | | plan: | | Reports via | | type & | | RAG memory | | Safety | | Telegram | | severity | | lookup | | Relevance | | | | | | | | Risk | | | | $0.0004 | | $0.002 | | $0.001 | | $0.0006 | +-------------+ +---------------+ +-----------+ +-------------+ | APPROVE (>=70) --> Executor ESCALATE (50-69) --> Human approval REJECT (<50) --> Pipeline blocked ``` **单次事件总成本：约 $0.004 | 响应时间：6 秒** ## 商业价值与 ROI | | 人类 SRE 值班 | 商业 AIOps | 本系统 | |---|---|---|---| | **年度成本** | $144k–216k | $30k–150k | **$901** | | **响应时间** | 15–45 分钟 | 5–15 分钟 | **6 秒** | | **单次事件成本** | 包含在内 | 包含在内 | **$0.004** | | **凌晨 3 点工作质量** | 下降 | 稳定 | **稳定** | | **供应商锁定** | 无 | 高 | **无** | | **相对 Pipeline 的 ROI** | −159x | −33x 到 −166x | **基准** | ## 快速开始 **前置条件：** Python 3.12，已配置 kubectl，Anthropic API 密钥 **步骤 1 — 克隆并安装** ``` git clone https://github.com/absence77/ai-agents-production.git cd ai-agents-production pip install anthropic chromadb fastapi uvicorn ``` **步骤 2 — 设置您的 API 密钥** ``` export ANTHROPIC_API_KEY="your-key-here" # 或创建 agents/.env: echo "ANTHROPIC_API_KEY=your-key-here" > agents/.env ``` **步骤 3 — 运行 Judge Agent（安全演示）** ``` cd agents python3 agent4_judge.py # 预期输出: # TEST 1 (safe plan): APPROVE score: 88/100 # TEST 2 (delete namespace): REJECT score: 24/100 Safety: 0/100 ``` **步骤 4 — 运行完整 Pipeline** ``` python3 pipeline.py # 部署一个崩溃的 pod，等待失败， # 运行全部 4 个 agents，通过 Telegram 报告结果 ``` ## Agent 架构 ### Agent-01：Detector 监控 Kubernetes 事件并按类型和严重程度对事件进行分类。 - 检测项：`CrashLoopBackOff`、`OOMKilled`、`Pending`、`NotFound`、`ResourceExhaustion` - 输出：包含 pod、namespace、severity、context 的 `IncidentPackage` 数据类 - 模型：`claude-haiku-4-5` · 成本：约 $0.0004 ### Agent-02：Researcher 使用集群状态和 RAG 记忆调查根本原因。 - 查询 kubectl 日志、事件、资源使用情况 - 从 ChromaDB 中检索 3 个最相似的历史事件 - 输出：包含 fix_command 和 confidence score 的 `ActionPlan` - 模型：`claude-sonnet-4-6` · 成本：约 $0.002 ### Agent-04：Judge 🐙 在执行前评估每一个行动计划——以机器速度进行同行评审。 - 评分维度：Safety · Relevance · Risk · Alternatives（各项 0–100 分） - APPROVE（批准）≥70 · ESCALATE（升级）50–69 · REJECT（拒绝）<50 - 已验证案例：以 Safety 0/100 拦截了 `kubectl delete namespace production` 命令 - 模型：`claude-haiku-4-5` · 成本：约 $0.001 ### Agent-03：Executor 应用已批准的修复方案并报告结果。 - 执行带有超时和验证的 kubectl 命令 - 熔断机制：在 2 次失败尝试后停止 - 将解决摘要发送至 Telegram，并提供人工升级选项 - 模型：`claude-haiku-4-5` · 成本：约 $0.0006 ## RAG 记忆 (ChromaDB) 每个已解决的事件都会作为语义 embedding 存储，并在未来发生类似事件时被检索。 ``` # 存储事件 incident_store.store_incident(package, plan, report) # 检索相似项 (在 Agent-02 中自动进行) similar = incident_store.search_similar(incident_description, n=3) ``` 经过 2 个月的生产运行：**22 起事件 · 368 KB · 100% 检索准确率** ## 基础设施 ``` Internet | v JumpServer (65.109.160.208) ──── kubectl ────> master-1 (37.27.41.55) | AI Server (204.168.252.69) worker-1 (37.27.86.7) ├── webhook_v2.py (FastAPI :8080) worker-2 (204.168.150.77) ├── ChromaDB (RAG memory) ├── multi_agent/ (Python agents) namespace: production └── OpenClaw (3 Telegram bots) ├── Prometheus ├── IELTS Tutor ├── Grafana 13.0.1 ├── Web3 Mentor └── AlertManager └── LLM Engineer Mentor ``` **Hetzner Cloud，赫尔辛基 eu-central · Kubernetes v1.30.14 · Calico CNI** ## 项目结构 ``` ai-agents-production/ ├── agents/ │ ├── agent1_detector.py # Incident detection & classification │ ├── agent2_researcher.py # Root cause analysis + RAG lookup │ ├── agent4_judge.py # LLM-as-a-Judge safety evaluator │ ├── agent3_executor.py # Fix execution & Telegram reporting │ ├── pipeline.py # Orchestrator: Agent-1→2→4→3 │ ├── safety_guard.py # CLI safety layer for kubectl │ └── webhook_v2.py # FastAPI webhook (Grafana → pipeline) ├── infra/ │ ├── manifests/ # Exported K8s manifests (10,573 lines) │ ├── backup/ # ChromaDB snapshots │ ├── helm-monitoring-values.yaml │ └── cluster-nodes.yaml ├── docs/ │ └── RECOVERY.md # Disaster recovery playbook ├── logs/ # Incident logs ├── .env.example # Environment variables template ├── requirements.txt ├── LICENSE └── README.md ``` ## 促成该系统建立的事件在准备 CKA 考试期间，生产环境的 namespace 被一条命令删除： ``` kubectl delete namespace production # no confirmation, no backup, gone in 2 seconds ``` 恢复耗时 1 天。数据损失：零。本系统——特别是 `safety_guard.py` 和 Agent-04 (Judge)——的存在正是为了确保此类事件不再发生。阅读完整故事：[Medium 上的 Part 11](https://medium.com/@ahmadgayibov) ## Medium 完整系列 | 部分 | 标题 | 核心指标 | |---|---|---| | 1–3 | Telegram AI Bot 平台 | 3 个 bot，独立工作区 | | 4 | 成本优化 | $20/周 → $7/周（减少 3 倍） | | 5 | Messages API Agent | 从零构建 4 个 Agent | | 6 | Claude 托管 Agent + K8s | 4 个 kubectl 命令，$0.034 | | 7 | 自主事件响应 | 6 秒，$0.004 | | 8 | RAG 记忆 (ChromaDB) | 22 起事件，语义搜索 | | 9 | Ollama 与 Claude 基准测试 | $0 对比 $0.004 —— 质量胜出 | | 10 | 多 Agent Pipeline | Detector→Researcher→Executor | | 11 | 生产灾难恢复 | 1 天，$0 数据损失 | | 12 | LLM-as-a-Judge | Safety 0/100 拦截 namespace 删除 | | 13 | 完整 ROI 分析 | 总计 $200，150x ROI | **阅读全部 13 部分：**[medium.com/@ahmadgayibov](https://medium.com/@ahmadgayibov) ## 技术栈 - **AI：** Anthropic Claude API (claude-sonnet-4-6, claude-haiku-4-5) - **记忆：** ChromaDB v1.5.8 (RAG 向量数据库) - **编排：** Kubernetes v1.30.14, Calico CNI - **监控：** Prometheus + Grafana 13.0.1 + AlertManager - **Webhook：** FastAPI + systemd (149 个请求，100% 正常运行时间) - **Bot：** OpenClaw 网关 + Telegram Bot API - **基础设施：** Hetzner Cloud (5 台服务器，赫尔辛基) - **语言：** Python 3.12 ## 许可证 MIT License —— 详情请参阅 [LICENSE](LICENSE)。 *构建于乌兹别克斯坦塔什干 · Ahmad Gayibov · IT 架构师与 AI 系统工程师* *[medium.com/@ahmadgayibov](https://medium.com/@ahmadgayibov) · [github.com/absence77](https://github.com/absence77)*

标签：AIOps, CISA项目, Claude API, DLL 劫持, IT运维自动化, K8s运维, PE 加载器, PyRIT, Python, RAG记忆, SOAR, SRE, Telegram机器人, 偏差过滤, 告警分发, 多智能体系统, 大语言模型, 子域名突变, 开源, 成本优化, 提示词注入, 故障自愈, 无后门, 根因分析, 生产案例研究, 站点可靠性工程, 系统架构, 自动化AI代理, 自定义请求头