RohanMulay1/siren

GitHub: RohanMulay1/siren

SIREN 是一个自主AI事件响应引擎，通过自动化调查和修复生产事件来缩短平均恢复时间。

Stars: 0 | Forks: 0

# SIREN — 自我改进事件响应引擎 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![Claude Opus 4.7](https://img.shields.io/badge/Claude-Opus%204.7-orange.svg)](https://anthropic.com) [![LangGraph](https://img.shields.io/badge/LangGraph-0.2+-green.svg)](https://github.com/langchain-ai/langgraph) [![Qdrant](https://img.shields.io/badge/Qdrant-Vector%20DB-red.svg)](https://qdrant.tech) **[实时演示 →](https://siren-ruby.vercel.app)** · **[仪表盘 →](https://siren-ruby.vercel.app/dashboard)** · **[API 文档 →](https://siren-api.onrender.com/docs)** ## SIREN 的功能当警报触发时，SIREN 会运行一个从调查到解决的完整循环，无需唤醒任何人： | 步骤 | 发生了什么 | |------|------------| | **摄入** | 规范化来自任何来源的 Webhook — Prometheus、CloudWatch、PagerDuty、自定义 | | **分类** | Claude Sonnet 在 < 2 秒内对严重级别（P1–P4）、服务和置信度进行分类 | | **召回** | Qdrant 余弦搜索找出最相似的前 5 个历史事件 | | **调查** | Claude Opus 运行多步骤工具使用循环 — 日志、指标、容器、Git 历史 | | **计划** | Opus 对修复操作列表进行排序，每个操作分类为 READ / REVERSIBLE / DESTRUCTIVE | | **门控** | READ 和 REVERSIBLE 操作自动执行；DESTRUCTIVE 操作暂停并发送 Slack 审批按钮 | | **执行** | 已批准的操作通过 Claude Haiku 运行；结果反馈到状态中 | | **验证** | Sonnet 检查修复后的实时指标并确认解决 | | **学习** | 编写事后分析并嵌入到 Qdrant 中 — 下一个类似事件处理会更快 | ### 自我改进循环 ``` Incident #1 (cold): 9.2 min MTTR — 6 tool calls to find root cause Incident #5 (5 in mem): 5.1 min MTTR — Qdrant surfaces 3 relevant past fixes Incident #10 (10 in mem): 2.9 min MTTR — 92% match injects the playbook directly ``` 对 13 个事件的测量显示 MTTR 降低了 68%。仪表盘显示了该趋势。 ## 架构 ``` Alert Webhook │ ▼ INGEST ──► TRIAGE ──── low confidence ──────────────────────► ESCALATE │ ▼ MEMORY RECALL (Qdrant) │ ▼ INVESTIGATE ◄─── loop until confidence ≥ 0.80 ────────┐ (Claude Opus) │ │ │ ▼ │ PLAN REMEDIATION │ │ │ ▼ │ GUARDRAIL GATE ──► READ / REVERSIBLE ──► EXECUTE ──────┘ │ │ ▼ │ DESTRUCTIVE ──► SLACK APPROVAL │ │ (graph paused │ │ in Redis) │ └──────────────────────────────────── ┘ │ ▼ VERIFY │ ▼ WRITE POST-MORTEM + UPSERT TO QDRANT ``` ### 多模型路由 | 节点 | 模型 | 原因 | |------|------|------| | 分类 | Claude Sonnet 4.6 | 快速结构化 JSON 输出，高并发 | | 记忆召回 | 无 LLM (Qdrant) | 确定性 — 无幻觉风险 | | 调查 | Claude Opus 4.7 | 最大化推理能力用于根本原因分析 | | 计划修复 | Claude Opus 4.7 | 基于历史上下文的高风险排序计划 | | 执行 | Claude Haiku 4.5 | 简单工具调度，速度至关重要 | | 验证 | Claude Sonnet 4.6 | 指标解释和解决判断 | | 事后分析 | Claude Sonnet 4.6 | 结构化长文写作，性价比高 | ### 技术栈 | 层级 | 技术 | |------|------| | 核心 LLM | Anthropic Claude — Opus 4.7 · Sonnet 4.6 · Haiku 4.5 | | 工作流 | LangGraph 有状态图 + Redis 检查点器 | | 向量记忆 | Qdrant — 余弦相似度，payload 索引 | | 嵌入 | fastembed `all-MiniLM-L6-v2`（本地 ONNX，无 API 成本） | | API | FastAPI + Uvicorn | | 持久化 | Redis（检查点）+ PostgreSQL（审计跟踪） | | 人工门控 | Slack SDK + Block Kit 交互式按钮 | | 集成 | AWS CloudWatch · Prometheus · Docker SDK · GitHub API | | 防护栏 | 确定性分类器 + 提示注入检测器 + 速率限制器 | | 可观测性 | LangSmith（跟踪查看器）+ OpenTelemetry | | 前端 | Next.js 16 App Router（登陆页 + 实时仪表盘） | | 基础设施 | Docker Compose + Render (API) + Vercel (前端) | ## 防护栏系统每个提议的操作在传递给任何 LLM 之前都会通过一个确定性安全层： ``` READ → fetch_cloudwatch_logs, query_prometheus, inspect_docker_container, query_postgres_readonly, git_blame_file Auto-executed immediately, no risk REVERSIBLE → restart_docker_container, scale_service, toggle_feature_flag Auto-executed when investigation confidence ≥ 85% DESTRUCTIVE → flush_redis_cache, drain_lb_node, execute_db_migration Always pauses graph → sends Slack approval button → resumes on click UNKNOWN → any unregistered tool name Default DESTRUCTIVE (safe fail) ``` 附加层： - **提示注入检测器** — 在进入 LLM 上下文之前，对每个工具输出进行正则表达式扫描 - **速率限制器** — 每小时最多 3 个 DESTRUCTIVE 操作（Redis 滑动窗口） - **参数级升级** — `scale_service(replicas=0)` 会升级为 DESTRUCTIVE，即使 `scale_service` 本身是 REVERSIBLE ## 快速开始 ### 前置条件 - Docker Desktop - Python 3.11+ - Anthropic API 密钥 ([console.anthropic.com](https://console.anthropic.com)) - Qdrant Cloud 账户（免费套餐 — [cloud.qdrant.io](https://cloud.qdrant.io)) - 可选：Slack 应用、AWS IAM 凭据、GitHub 令牌 ### 1. 克隆并配置 ``` git clone https://github.com/RohanMulay1/siren cd siren cp .env.example .env ``` 启动所需的最小 `.env` 文件： ``` ANTHROPIC_API_KEY=sk-ant-... QDRANT_URL=https://your-cluster.qdrant.io QDRANT_API_KEY=... DATABASE_URL=postgresql://siren:siren@localhost:5432/siren REDIS_URL=redis://localhost:6379 ``` ### 2. 启动基础设施 ``` docker compose up -d # 启动：Redis、PostgreSQL、Qdrant（本地）、Prometheus、Grafana ``` ### 3. 填充历史事件 ``` pip install -e . python scripts/seed_qdrant.py # 加载 20 个合成事件，使召回功能可从演示运行 #1 开始 ``` ### 4. 启动 SIREN ``` uvicorn siren.main:app --reload # http://localhost:8000 # API 文档：http://localhost:8000/docs ``` ### 5. 触发演示事件 ``` python scripts/trigger_demo.py ``` 观察 SIREN 将 Redis OOM 分类为 P1，从记忆中召回过去的修复方法，通过 Opus 使用 3 次工具调用进行调查，自动执行容器重启，为缓存清除发送 Slack 审批按钮，通过 Prometheus 验证解决，并编写事后分析 — 全部在 4 分钟内完成。 ## 演示场景：Redis OOM 这个标准演示会遍历每个节点，包括 DESTRUCTIVE 审批门控。 **SIREN 的操作：** 1. **警报** → `payments-api` 错误率 40%，日志中出现 Redis OOM 错误 2. **分类** → P1, `payments-api`, 置信度 0.94 3. **召回** → Qdrant 返回 INC-20260418 (92% 匹配度) — 相同 OOM，通过 FLUSHDB 解决 4. **调查** (Claude Opus, 3 次工具调用): - `query_prometheus` → error_rate=45.2%, p99=8400ms - `fetch_cloudwatch_logs` → “OOM command not allowed when used memory > maxmemory” ×847 - `inspect_docker_container` → mem=99.8%, restart_count=3 5. **计划** → `[restart_docker_container (REVERSIBLE), flush_redis_cache (DESTRUCTIVE)]` 6. **门控** → 重启自动执行；清除发送 Slack 审批按钮 7. **Slack** → 工程师点击 APPROVE → 图从 Redis 检查点恢复 8. **验证** → `query_prometheus` → error_rate=0.2%, p99=118ms ✓ 9. **事后分析** → 由 Sonnet 编写，嵌入 Qdrant 10. **MTTR** → 2.4 分钟（对比冷启动的 9.2 分钟） ## API 参考 ``` POST /webhook/alert Ingest an alert from any monitoring source Body: { source, alert_name, severity, service, description, labels } Returns: { incident_id, status } POST /webhook/slack/action Handle Slack interactive button clicks (APPROVE / REJECT) Resumes the paused LangGraph checkpoint GET /api/incidents List recent incidents with status and MTTR GET /api/incidents/{incident_id} Full LangGraph state for one incident GET /health { status, qdrant_incidents, environment } ``` **警报 payload 示例：** ``` { "source": "prometheus", "alert_name": "RedisOOMKiller", "severity": "critical", "service": "payments-api", "description": "OOM killer triggered 3x in 5 minutes. Auth token keys filling maxmemory.", "labels": { "env": "production", "region": "us-east-1" } } ``` ## 添加新工具 ``` from ..registry import register_tool @register_tool("READ") class CheckRedisMemory: NAME = "check_redis_memory" DESCRIPTION = "Get Redis memory usage. Use during OOM investigations." INPUT_SCHEMA = { "type": "object", "properties": { "redis_url": {"type": "string"} }, "required": ["redis_url"] } @staticmethod async def handler(redis_url: str) -> str: import redis.asyncio as aioredis r = aioredis.from_url(redis_url) info = await r.info("memory") await r.aclose() return f"used={info['used_memory_human']} peak={info['used_memory_peak_human']}" ``` 1. 按上述模式创建 `siren/tools/{tier}/{tool_name}.py` 2. 在 `siren/tools/__init__.py` 中导入它 3. 分类器会自动从装饰器中获取层级信息 ## Slack 设置 1. [api.slack.com/apps](https://api.slack.com/apps) → **创建新应用 → 从头开始** 2. **OAuth 和权限** → 机器人令牌范围 → `chat:write`, `chat:write.public` 3. **安装到工作区** → 复制机器人令牌 → `SLACK_BOT_TOKEN` 4. **基本信息** → 签名密钥 → `SLACK_SIGNING_SECRET` 5. 右键单击你的 `#incidents` 频道 → 查看详细信息 → 复制频道 ID → `SLACK_CHANNEL_ID` 6. 暴露本地服务器：`cloudflared tunnel --url http://localhost:8000` 7. **交互性和快捷方式** → 请求 URL：`https://your-tunnel/webhook/slack/action` ## 项目结构 ``` siren/ ├── siren/ │ ├── agent/ │ │ ├── graph.py # LangGraph state machine — all nodes + edges │ │ ├── state.py # IncidentState TypedDict │ │ ├── routing.py # Conditional edge functions │ │ └── nodes/ # One file per workflow node │ ├── tools/ │ │ ├── registry.py # @register_tool decorator + TOOL_REGISTRY │ │ ├── read/ # CloudWatch, Prometheus, DB, Docker, GitHub │ │ ├── reversible/ # Container restart, scale service │ │ └── destructive/ # Redis flush, ALB drain, DB migration │ ├── guardrails/ │ │ ├── classifier.py # Deterministic READ/REVERSIBLE/DESTRUCTIVE lookup │ │ ├── injection_detector.py # Regex scan on every tool output │ │ └── rate_limiter.py # Redis sliding window counter │ ├── memory/ │ │ ├── qdrant_client.py # Collection setup + payload indexes │ │ ├── embedder.py # fastembed local ONNX inference │ │ ├── incident_store.py # recall_similar() + upsert_postmortem() │ │ └── schemas.py # IncidentVectorPayload │ ├── integrations/slack/ # Block Kit messages + approval webhook handler │ ├── api/routers/ # /webhook/alert, /webhook/slack/action, /incidents │ ├── db/ # Postgres models, session, action audit writer │ └── observability/ # LangSmith + OpenTelemetry ├── frontend/ # Next.js 16 — landing page + ops dashboard ├── scripts/ │ ├── seed_qdrant.py # Pre-load 20 synthetic incidents │ └── trigger_demo.py # Fire the Redis OOM scenario ├── tests/unit/ └── tests/integration/ ``` ## 许可证 MIT — 参见 [LICENSE](LICENSE)。欢迎贡献。

标签：AI代理, Claude模型, CloudWatch监控, DLL 劫持, Git历史分析, LangGraph, PagerDuty集成, Prometheus监控, Python, Qdrant, Slack通知, Webhook集成, 事故响应引擎, 修复计划生成, 向量数据库, 大语言模型, 审批流程, 工具使用循环, 平均恢复时间优化, 持续学习, 搜索引擎查询, 无后门, 根因分析, 测试用例, 生产环境监控, 自动化运维, 自改进系统, 请求拦截, 逆向工具