nareshram5855/sre-ai-copilot

GitHub: nareshram5855/sre-ai-copilot

基于本地 LLM 与 RAG 管道的 SRE 事件响应平台，提供智能告警分类、On-Call 知识问答、自动 RCA 生成和半自动化 Runbook 执行等能力，通过分层 LLM 路由兼顾成本与数据安全。

Stars: 0 | Forks: 0

# SRE AI Copilot ## 它能做什么基于本地 LLM 和 RAG pipeline 对你自己的 runbook 和事件历史提供支持的五大 SRE 能力： | 功能 | 状态 | 描述 | |---|---|---| | 智能告警分类 | ✅ Phase 1 | RAG 检索类似的历史事件 → LLM 对 P1/P2/P3 进行分类并附带置信度分数 | | On-Call 知识助手 | ✅ Phase 2 | 针对_runbook_的自然语言问答 — 提供确切的命令解答 (ChatWidget) | | 自主 Runbook 执行器 | ✅ Phase 2 | 包含 kubectl 工具和人工审批节点的 LangGraph ReAct 循环 | | 自动 RCA 生成器 | ✅ Phase 2 | 收集事件时间线 → LLM 生成完整的 RCA | | Supervisor Agent | ✅ Phase 2 | 单一 `/api/v1/ask` 入口点 — 对意图进行分类并路由至专家 Agent | | 事件分析 | ✅ Phase 2 | 实时 Prometheus/Loki 遥测分析 + 合成演示 stack | | 语音输入（可选） | ✅ Plugin | 通过 Whisper 实现 STT — 使用 `VOICE_VOICE_ENABLED=true` 启用 | | 主动异常预测器 | 🔧 Phase 3 | 持续监控 Prometheus 指标，在停机前创建工单 | ## 架构 ``` ┌─────────────────────────────────────────────────────────────┐ │ React Frontend │ │ Alert Panel │ Chat Interface │ └──────────────────────────┬──────────────────────────────────┘ │ HTTP (localhost:5173 → :8080) ┌──────────────────────────▼──────────────────────────────────┐ │ FastAPI Backend │ │ routers/ agents/ llm/ │ │ ├─ health.py ├─ base.py ├─ router.py │ │ ├─ triage.py └─ triage_agent.py└─ sanitizer.py │ │ └─ knowledge.py │ └──────┬───────────────────┬─────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌────────────────────────────────────────────┐ │ ChromaDB │ │ LLM Routing Layer │ │ (Minikube) │ │ │ │ │ │ Complexity → Tier → Model │ │ incidents │ │ ───────────────────────────────── │ │ runbooks │ │ LOW LOCAL Mistral 7B ($0) │ │ architecture│ │ MEDIUM STANDARD Gemini Flash ($0.11) │ └─────────────┘ │ HIGH ADVANCED Claude Haiku ($2.40) │ ▲ │ CRITICAL PREMIUM Claude Sonnet ($9) │ │ │ │ │ │ • Data sanitized before any external call │ │ │ • Every external call audit-logged │ │ │ • allow_external_llm=false by default │ ┌──────┴──────┐ └────────────────┬───────────────────────────┘ │ Embeddings │ │ │ nomic- │ ┌───────▼──────────────────┐ │ embed-text │ │ Ollama (native macOS) │ │ (Ollama) │ │ Mistral 7B Q4_K_M │ └─────────────┘ │ Metal GPU acceleration │ └──────────────────────────┘ ``` ## 技术栈 | 层级 | 技术 | 原因 | |---|---|---| | LLM Runtime | Ollama (native macOS) | Metal GPU — 在 Apple Silicon 上比 Docker 快 3 倍 | | 主要模型 | Mistral 7B Q4_K_M | 结构化 JSON 输出的最佳质量/大小比 | | Embeddings | nomic-embed-text | 768 维，复用 Ollama 进程，每批次约 200ms | | LLM 升级 | Gemini Flash / Claude Haiku / Claude Sonnet | 分层成本 — 仅为真正复杂的问题付费 | | RAG 框架 | LangChain LCEL | 可组合的链，检索和生成的清晰分离 | | Vector Store | ChromaDB | 在 Minikube 中运行（同样的 manifests → Phase 4 中的 EKS） | | 后端 | FastAPI + Python 3.11 | 异步，自动生成 OpenAPI 文档，内置 Prometheus 监控 | | 前端 | React 18 + Vite + TailwindCSS | HMR，暗黑主题，最小 bundle | | 基础设施（本地） | Minikube + kubectl | Manifests 是 Phase 4 中 Helm charts 的直接基础 | | 基础设施（生产） | EKS + Terraform + ArgoCD + Helm | Phase 4 | ## 项目结构 ``` sre-ai-copilot/ ├── backend/ │ ├── agents/ │ │ ├── base.py ← BaseAgent: LLM routing, RAG, timing, error handling │ │ └── triage_agent.py ← Phase 1: alert severity classification │ ├── llm/ │ │ ├── router.py ← complexity scorer + LLM factory + audit logger │ │ └── sanitizer.py ← strips hostnames/IPs/credentials before external calls │ ├── rag/ │ │ ├── embeddings.py ← nomic-embed-text via Ollama │ │ ├── ingestor.py ← markdown → chunks → ChromaDB │ │ └── retriever.py ← similarity search with scores │ ├── routers/ │ │ ├── _models.py ← shared Pydantic schemas │ │ ├── health.py │ │ ├── triage.py │ │ └── knowledge.py │ ├── knowledge/ │ │ ├── incidents/ ← past incident markdown files (IAM/CISO domain) │ │ ├── runbooks/ ← operational runbooks │ │ └── architecture/ ← system architecture docs │ ├── config.py ← all settings + knowledge collection registry │ ├── main.py ← FastAPI app + router registration │ └── requirements.txt ├── frontend/ │ └── src/ │ ├── components/ │ │ ├── AlertPanel.jsx ← alert submission form + example quick-loads │ │ ├── TriageResult.jsx ← severity badge, confidence meter, suggested fix │ │ ├── ChatWidget.jsx ← Floating chat + optional voice input │ │ ├── VoiceButton.jsx ← Microphone (requires VOICE_VOICE_ENABLED=true) │ │ └── Sidebar.jsx ← navigation + backend health indicator │ └── App.jsx ├── infrastructure/ │ ├── k8s/ │ │ └── chromadb.yaml ← Deployment + Service + PVC (Helm-annotated) │ ├── helm/ ← Phase 4 │ ├── terraform/ ← Phase 4 │ └── argocd/ ← Phase 4 ├── monitoring/ │ ├── prometheus/ ← Phase 3 │ └── grafana/ ← Phase 3 ├── Dockerfile ← production image (non-root, healthcheck, 2 workers) ├── Makefile ← all dev commands └── .env.example ``` ## 快速开始 ### 前置条件 - macOS Apple Silicon (M1/M2/M3/M4) - Python 3.11 (`brew install python@3.11`) - Node.js 20+ (`brew install node`) - Minikube (`brew install minikube`) - Ollama (`brew install ollama`) ### 一次性设置 ``` git clone && cd sre-ai-copilot # 1. 安装 Python deps（需要 3.11 — PyO3 尚不支持 3.14） make setup # 2. 拉取 LLM models（总共约 4.4 GB） make ollama-setup # 3. 启动 Ollama（Metal GPU，保持此终端开启） ollama serve ``` ### 日常开发 ``` # 终端 1 — Infrastructure（Minikube 中的 ChromaDB） minikube start make start-infra # deploys ChromaDB + port-forwards :8000 # 终端 2 — Backend（热重载） source venv/bin/activate make start-backend # FastAPI on :8080 # 终端 3 — Ingest 知识库 make ingest # loads incidents + runbooks into ChromaDB # 终端 4 — Frontend（HMR） make start-frontend # React on http://localhost:5173 ``` ### 验证一切正常 ``` make status ``` 预期结果： ``` === Backend (localhost:8080) === {"status": "healthy", "ollama_model": "mistral", "environment": "development"} === ChromaDB (localhost:8000) === healthy === ChromaDB collections === {"status": "connected", "collections": {"incidents": 24, "runbooks": 31, "architecture": 8}} === Ollama (localhost:11434) === models: ['mistral:latest', 'nomic-embed-text:latest'] ``` ### 测试告警分类 ``` curl -X POST http://localhost:8080/api/v1/triage \ -H "Content-Type: application/json" \ -d '{ "name": "KubePodCrashLooping", "description": "ping-identity-auth pod restarting with OOMKilled exit code", "labels": {"namespace": "iam", "severity": "critical", "app": "ping-identity-auth"}, "value": 8, "environment": "production" }' | python3 -m json.tool ``` ## LLM 路由问题复杂度会根据告警信号自动评分 — 无需手动配置： ``` # 增加 complexity score 的信号 rag_hit_count == 0 # Novel problem, no historical precedent (+4) severity == "critical" # AlertManager critical label (+3) risk keywords present # "breach", "data loss", "unknown", etc. (+2) environment == "production" # Higher stakes (+1) ``` | 分数 | 复杂度 | LLM 层级 | 模型 | 成本/月* | |---|---|---|---|---| | 0-1 | LOW | LOCAL | Mistral 7B | $0 | | 2-3 | MEDIUM | STANDARD | Gemini 1.5 Flash | ~$0.11 | | 4-6 | HIGH | ADVANCED | Claude Haiku 4.5 | ~$2.40 | | 7+ | CRITICAL | PREMIUM | Claude Sonnet 4.6 | 按需付费 | \* 基于每月 1,000 次升级的告警 **安全：** 外部 LLM 默认禁用 (`allow_external_llm=false`)。启用后，所有 payload 都会进行净化处理（主机名、IP、凭证会被脱敏），并且每次调用都会记录在审计日志中。 ## 添加新 Agent (Phase 2+) ``` # 1. backend/agents/your_agent.py class YourAgent(BaseAgent): def run(self, payload: dict) -> dict: docs = self._retrieve(query, collection="runbooks") llm, safe_payload, tier, complexity = self._route_llm(payload, len(docs)) # ... domain logic return result # 2. backend/routers/your_router.py router = APIRouter(prefix="/api/v1", tags=["your-feature"]) _agent = YourAgent() @router.post("/your-endpoint") def your_endpoint(payload: YourPayload): result = _agent.execute(payload.model_dump()) result.pop("_meta", None) return result # 3. backend/main.py — 一行 app.include_router(your_router.router) ``` ## 添加新知识领域 ``` # 1. 创建包含 markdown 文件的目录 mkdir backend/knowledge/slos/ echo "# SLO: Token Service\nTarget: p99 < 200ms..." > backend/knowledge/slos/token_service.md # 2. 在 config.py 中注册 knowledge_collections = { "incidents": "sre_incidents", "runbooks": "sre_runbooks", "slos": "sre_slos", # ← add this line } # 3. Ingest make ingest ``` ## 路线图 | 阶段 | 状态 | 关键交付物 | |---|---|---| | **Phase 1 — 本地原型** | ✅ Complete | Ollama + Mistral, RAG pipeline, ChromaDB, FastAPI, React, Minikube, LLM 路由 | | **Phase 2 — 智能** | ✅ Complete | ReAct agent, LangGraph executor, supervisor, 对话记忆, runbook executor, RCA 生成器, 知识聊天, 事件分析, 语音插件 | | **Phase 3 — 自动学习** | 🔧 In progress | LearningAgent (`resolved_incidents` collection, resolve endpoint, auto-runbook promotion) | | **Phase 4 — 生产** | ⬜ Planned | EKS + Terraform, Helm charts, ArgoCD GitOps, GitHub Actions CI/CD, Karpenter | ## 端口 | 服务 | 端口 | 运行时 | |---|---|---| | Ollama | 11434 | Native macOS (Metal GPU) | | ChromaDB | 8000 | Minikube (port-forwarded) | | FastAPI 后端 | 8080 | Native Python 3.11 venv | | React 前端 | 5173 | Native Node (Vite) |

标签：AIOps, AI风险缓解, DLL 劫持, RAG, SRE, 偏差过滤, 大语言模型, 自动化运维, 自定义请求头, 运维, 逆向工具