nareshram5855/sre-ai-copilot
GitHub: nareshram5855/sre-ai-copilot
基于本地 LLM 与 RAG 管道的 SRE 事件响应平台,提供智能告警分类、On-Call 知识问答、自动 RCA 生成和半自动化 Runbook 执行等能力,通过分层 LLM 路由兼顾成本与数据安全。
Stars: 0 | Forks: 0
# SRE AI Copilot
## 它能做什么
基于本地 LLM 和 RAG pipeline 对你自己的 runbook 和事件历史提供支持的五大 SRE 能力:
| 功能 | 状态 | 描述 |
|---|---|---|
| 智能告警分类 | ✅ Phase 1 | RAG 检索类似的历史事件 → LLM 对 P1/P2/P3 进行分类并附带置信度分数 |
| On-Call 知识助手 | ✅ Phase 2 | 针对_runbook_的自然语言问答 — 提供确切的命令解答 (ChatWidget) |
| 自主 Runbook 执行器 | ✅ Phase 2 | 包含 kubectl 工具和人工审批节点的 LangGraph ReAct 循环 |
| 自动 RCA 生成器 | ✅ Phase 2 | 收集事件时间线 → LLM 生成完整的 RCA |
| Supervisor Agent | ✅ Phase 2 | 单一 `/api/v1/ask` 入口点 — 对意图进行分类并路由至专家 Agent |
| 事件分析 | ✅ Phase 2 | 实时 Prometheus/Loki 遥测分析 + 合成演示 stack |
| 语音输入(可选) | ✅ Plugin | 通过 Whisper 实现 STT — 使用 `VOICE_VOICE_ENABLED=true` 启用 |
| 主动异常预测器 | 🔧 Phase 3 | 持续监控 Prometheus 指标,在停机前创建工单 |
## 架构
```
┌─────────────────────────────────────────────────────────────┐
│ React Frontend │
│ Alert Panel │ Chat Interface │
└──────────────────────────┬──────────────────────────────────┘
│ HTTP (localhost:5173 → :8080)
┌──────────────────────────▼──────────────────────────────────┐
│ FastAPI Backend │
│ routers/ agents/ llm/ │
│ ├─ health.py ├─ base.py ├─ router.py │
│ ├─ triage.py └─ triage_agent.py└─ sanitizer.py │
│ └─ knowledge.py │
└──────┬───────────────────┬─────────────────────────────────┘
│ │
▼ ▼
┌─────────────┐ ┌────────────────────────────────────────────┐
│ ChromaDB │ │ LLM Routing Layer │
│ (Minikube) │ │ │
│ │ │ Complexity → Tier → Model │
│ incidents │ │ ───────────────────────────────── │
│ runbooks │ │ LOW LOCAL Mistral 7B ($0) │
│ architecture│ │ MEDIUM STANDARD Gemini Flash ($0.11) │
└─────────────┘ │ HIGH ADVANCED Claude Haiku ($2.40) │
▲ │ CRITICAL PREMIUM Claude Sonnet ($9) │
│ │ │
│ │ • Data sanitized before any external call │
│ │ • Every external call audit-logged │
│ │ • allow_external_llm=false by default │
┌──────┴──────┐ └────────────────┬───────────────────────────┘
│ Embeddings │ │
│ nomic- │ ┌───────▼──────────────────┐
│ embed-text │ │ Ollama (native macOS) │
│ (Ollama) │ │ Mistral 7B Q4_K_M │
└─────────────┘ │ Metal GPU acceleration │
└──────────────────────────┘
```
## 技术栈
| 层级 | 技术 | 原因 |
|---|---|---|
| LLM Runtime | Ollama (native macOS) | Metal GPU — 在 Apple Silicon 上比 Docker 快 3 倍 |
| 主要模型 | Mistral 7B Q4_K_M | 结构化 JSON 输出的最佳质量/大小比 |
| Embeddings | nomic-embed-text | 768 维,复用 Ollama 进程,每批次约 200ms |
| LLM 升级 | Gemini Flash / Claude Haiku / Claude Sonnet | 分层成本 — 仅为真正复杂的问题付费 |
| RAG 框架 | LangChain LCEL | 可组合的链,检索和生成的清晰分离 |
| Vector Store | ChromaDB | 在 Minikube 中运行(同样的 manifests → Phase 4 中的 EKS) |
| 后端 | FastAPI + Python 3.11 | 异步,自动生成 OpenAPI 文档,内置 Prometheus 监控 |
| 前端 | React 18 + Vite + TailwindCSS | HMR,暗黑主题,最小 bundle |
| 基础设施(本地) | Minikube + kubectl | Manifests 是 Phase 4 中 Helm charts 的直接基础 |
| 基础设施(生产) | EKS + Terraform + ArgoCD + Helm | Phase 4 |
## 项目结构
```
sre-ai-copilot/
├── backend/
│ ├── agents/
│ │ ├── base.py ← BaseAgent: LLM routing, RAG, timing, error handling
│ │ └── triage_agent.py ← Phase 1: alert severity classification
│ ├── llm/
│ │ ├── router.py ← complexity scorer + LLM factory + audit logger
│ │ └── sanitizer.py ← strips hostnames/IPs/credentials before external calls
│ ├── rag/
│ │ ├── embeddings.py ← nomic-embed-text via Ollama
│ │ ├── ingestor.py ← markdown → chunks → ChromaDB
│ │ └── retriever.py ← similarity search with scores
│ ├── routers/
│ │ ├── _models.py ← shared Pydantic schemas
│ │ ├── health.py
│ │ ├── triage.py
│ │ └── knowledge.py
│ ├── knowledge/
│ │ ├── incidents/ ← past incident markdown files (IAM/CISO domain)
│ │ ├── runbooks/ ← operational runbooks
│ │ └── architecture/ ← system architecture docs
│ ├── config.py ← all settings + knowledge collection registry
│ ├── main.py ← FastAPI app + router registration
│ └── requirements.txt
├── frontend/
│ └── src/
│ ├── components/
│ │ ├── AlertPanel.jsx ← alert submission form + example quick-loads
│ │ ├── TriageResult.jsx ← severity badge, confidence meter, suggested fix
│ │ ├── ChatWidget.jsx ← Floating chat + optional voice input
│ │ ├── VoiceButton.jsx ← Microphone (requires VOICE_VOICE_ENABLED=true)
│ │ └── Sidebar.jsx ← navigation + backend health indicator
│ └── App.jsx
├── infrastructure/
│ ├── k8s/
│ │ └── chromadb.yaml ← Deployment + Service + PVC (Helm-annotated)
│ ├── helm/ ← Phase 4
│ ├── terraform/ ← Phase 4
│ └── argocd/ ← Phase 4
├── monitoring/
│ ├── prometheus/ ← Phase 3
│ └── grafana/ ← Phase 3
├── Dockerfile ← production image (non-root, healthcheck, 2 workers)
├── Makefile ← all dev commands
└── .env.example
```
## 快速开始
### 前置条件
- macOS Apple Silicon (M1/M2/M3/M4)
- Python 3.11 (`brew install python@3.11`)
- Node.js 20+ (`brew install node`)
- Minikube (`brew install minikube`)
- Ollama (`brew install ollama`)
### 一次性设置
```
git clone && cd sre-ai-copilot
# 1. 安装 Python deps(需要 3.11 — PyO3 尚不支持 3.14)
make setup
# 2. 拉取 LLM models(总共约 4.4 GB)
make ollama-setup
# 3. 启动 Ollama(Metal GPU,保持此终端开启)
ollama serve
```
### 日常开发
```
# 终端 1 — Infrastructure(Minikube 中的 ChromaDB)
minikube start
make start-infra # deploys ChromaDB + port-forwards :8000
# 终端 2 — Backend(热重载)
source venv/bin/activate
make start-backend # FastAPI on :8080
# 终端 3 — Ingest 知识库
make ingest # loads incidents + runbooks into ChromaDB
# 终端 4 — Frontend(HMR)
make start-frontend # React on http://localhost:5173
```
### 验证一切正常
```
make status
```
预期结果:
```
=== Backend (localhost:8080) ===
{"status": "healthy", "ollama_model": "mistral", "environment": "development"}
=== ChromaDB (localhost:8000) ===
healthy
=== ChromaDB collections ===
{"status": "connected", "collections": {"incidents": 24, "runbooks": 31, "architecture": 8}}
=== Ollama (localhost:11434) ===
models: ['mistral:latest', 'nomic-embed-text:latest']
```
### 测试告警分类
```
curl -X POST http://localhost:8080/api/v1/triage \
-H "Content-Type: application/json" \
-d '{
"name": "KubePodCrashLooping",
"description": "ping-identity-auth pod restarting with OOMKilled exit code",
"labels": {"namespace": "iam", "severity": "critical", "app": "ping-identity-auth"},
"value": 8,
"environment": "production"
}' | python3 -m json.tool
```
## LLM 路由
问题复杂度会根据告警信号自动评分 — 无需手动配置:
```
# 增加 complexity score 的信号
rag_hit_count == 0 # Novel problem, no historical precedent (+4)
severity == "critical" # AlertManager critical label (+3)
risk keywords present # "breach", "data loss", "unknown", etc. (+2)
environment == "production" # Higher stakes (+1)
```
| 分数 | 复杂度 | LLM 层级 | 模型 | 成本/月* |
|---|---|---|---|---|
| 0-1 | LOW | LOCAL | Mistral 7B | $0 |
| 2-3 | MEDIUM | STANDARD | Gemini 1.5 Flash | ~$0.11 |
| 4-6 | HIGH | ADVANCED | Claude Haiku 4.5 | ~$2.40 |
| 7+ | CRITICAL | PREMIUM | Claude Sonnet 4.6 | 按需付费 |
\* 基于每月 1,000 次升级的告警
**安全:** 外部 LLM 默认禁用 (`allow_external_llm=false`)。启用后,所有 payload 都会进行净化处理(主机名、IP、凭证会被脱敏),并且每次调用都会记录在审计日志中。
## 添加新 Agent (Phase 2+)
```
# 1. backend/agents/your_agent.py
class YourAgent(BaseAgent):
def run(self, payload: dict) -> dict:
docs = self._retrieve(query, collection="runbooks")
llm, safe_payload, tier, complexity = self._route_llm(payload, len(docs))
# ... domain logic
return result
# 2. backend/routers/your_router.py
router = APIRouter(prefix="/api/v1", tags=["your-feature"])
_agent = YourAgent()
@router.post("/your-endpoint")
def your_endpoint(payload: YourPayload):
result = _agent.execute(payload.model_dump())
result.pop("_meta", None)
return result
# 3. backend/main.py — 一行
app.include_router(your_router.router)
```
## 添加新知识领域
```
# 1. 创建包含 markdown 文件的目录
mkdir backend/knowledge/slos/
echo "# SLO: Token Service\nTarget: p99 < 200ms..." > backend/knowledge/slos/token_service.md
# 2. 在 config.py 中注册
knowledge_collections = {
"incidents": "sre_incidents",
"runbooks": "sre_runbooks",
"slos": "sre_slos", # ← add this line
}
# 3. Ingest
make ingest
```
## 路线图
| 阶段 | 状态 | 关键交付物 |
|---|---|---|
| **Phase 1 — 本地原型** | ✅ Complete | Ollama + Mistral, RAG pipeline, ChromaDB, FastAPI, React, Minikube, LLM 路由 |
| **Phase 2 — 智能** | ✅ Complete | ReAct agent, LangGraph executor, supervisor, 对话记忆, runbook executor, RCA 生成器, 知识聊天, 事件分析, 语音插件 |
| **Phase 3 — 自动学习** | 🔧 In progress | LearningAgent (`resolved_incidents` collection, resolve endpoint, auto-runbook promotion) |
| **Phase 4 — 生产** | ⬜ Planned | EKS + Terraform, Helm charts, ArgoCD GitOps, GitHub Actions CI/CD, Karpenter |
## 端口
| 服务 | 端口 | 运行时 |
|---|---|---|
| Ollama | 11434 | Native macOS (Metal GPU) |
| ChromaDB | 8000 | Minikube (port-forwarded) |
| FastAPI 后端 | 8080 | Native Python 3.11 venv |
| React 前端 | 5173 | Native Node (Vite) |
标签:AIOps, AI风险缓解, DLL 劫持, RAG, SRE, 偏差过滤, 大语言模型, 自动化运维, 自定义请求头, 运维, 逆向工具