gy12hui414bh521u/AIOps-Agent

GitHub: gy12hui414bh521u/AIOps-Agent

基于 LangGraph 的开源 AIOps 事件响应智能体系统，通过可插拔的 Runbook 技能注册表自动完成告警分流、根因分析和修复计划，在保留人工审批的前提下大幅缩短事件响应时间。

Stars: 0 | Forks: 0

# ⚡ IRAS — 事件响应智能体系统 ### 你的值班工程师在凌晨 3 点被叫醒。 ### IRAS 已经找到了根本原因，编写了修复计划，并正在等待你的批准。
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-3776AB?style=flat-square&logo=python&logoColor=white)](https://python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-0.115%2B-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![LangGraph](https://img.shields.io/badge/LangGraph-0.2%2B-FF6B35?style=flat-square)](https://langchain-ai.github.io/langgraph/) [![Pydantic AI](https://img.shields.io/badge/Pydantic_AI-1.56%2B-E92063?style=flat-square)](https://ai.pydantic.dev) [![Claude](https://img.shields.io/badge/Claude-Sonnet_%7C_Haiku-blueviolet?style=flat-square&logo=anthropic&logoColor=white)](https://anthropic.com) [![License: MIT](https://img.shields.io/badge/license-MIT-blue?style=flat-square)](LICENSE) [![Docker](https://img.shields.io/badge/docker-ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://docker.com)
[**快速开始**](#-quick-start) · [**工作原理**](#-how-it-works) · [**Runbook 技能注册表**](#-runbook-skill-registry-extension) · [**架构**](#-architecture) · [**配置**](#-configuration)

## 面临的问题你肯定经历过。凌晨 3 点。PagerDuty 告警响起。你跌跌撞撞地走到笔记本电脑前，眯着眼睛看图表，在日志中翻找，交叉比对最近的部署，提出假设，写一条 Slack 消息，等待批准，应用修复，然后花一个小时写一份没人看的复盘报告。每次都是如此。 **IRAS 能自动完成所有这些——在 2 分钟内——并且只会在需要你点击“批准”时才叫醒你。** ## 关于本仓库本仓库包含 **IRAS 事件响应智能体** —— 一个基于 LangGraph 的开源 AIOps 系统 —— 并扩展了一个 **Runbook 技能注册表**，该注册表能根据告警上下文动态选择特定事件类型的 runbook。 | 层级 | 描述 | |---|---| | **IRAS** (原版) | 包含 9 个节点的 LangGraph 状态机，使用 Pydantic AI 智能体 (Claude Haiku + Sonnet) 进行分流、上下文收集、RCA、修复计划、人工批准、执行/升级以及事后复盘生成。 | | **Runbook 技能注册表** (扩展) | 可插拔的 `skills//SKILL.md` 模块，用于针对不同事件类型指导证据收集、RCA 推理、修复规划和安全约束。采用确定性的基于关键词的选择机制 —— 无需 LLM。 | **归属说明：** 原始的 IRAS 项目提供了基础的 LangGraph 事件响应工作流。Runbook 技能注册表扩展在此之上添加了一个技能层 —— 它**不会**重建或替换核心工作流，也不会移除人工批准或授权自主修复。 ## 快速开始 ``` # Clone git clone && cd # 启动 Postgres docker run -d --name iras-postgres \ -e POSTGRES_USER=iras -e POSTGRES_PASSWORD=secret -e POSTGRES_DB=iras \ -p 5432:5432 postgres:16 # 安装 cd IRAS python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" # 配置（仅需两个字段） cp .env.example .env # 设置 ANTHROPIC_API_KEY 和 POSTGRES_URL # 运行 python run.py ``` ``` INFO IRAS graph compiled and ready INFO Uvicorn running on http://0.0.0.0:8000 ``` ``` # 触发警报 curl -X POST http://localhost:8000/webhook/alert \ -H "Content-Type: application/json" \ -d '{ "title": "High error rate on payment-service", "timestamp": "2026-05-03T10:30:00Z", "service": "payment-service", "error_rate": 0.45 }' # → {"incident_id": "550e8400-...", "status": "processing"} # 在终端中查看其运行情况，然后进行批准 curl -X POST http://localhost:8000/incidents/550e8400-.../approve ``` ## 工作原理 IRAS 运行一个 **包含 9 个节点的 LangGraph 状态机**。每个阶段都会生成一个类型化的 Pydantic 模型 —— 没有原始字符串，也不需要你去解析 prompt 输出。 ``` Alert → Triage → Context → RCA → Plan → [YOU] → Apply → Post-mortem ↑ ↓ └── retry if confidence < 0.7 ``` **阶段 1 — 摄取**：接受任何包含 `title` + `timestamp` 的 JSON webhook。无论是 PagerDuty、Prometheus AlertManager、Datadog、Grafana —— 还是原生的 `curl`。额外的字段会直接传递给 AI。 **阶段 2 — 分流** *(Claude Haiku)*：判定 P0–P3 严重级别、受影响的服务、估计的爆炸半径以及置信度分数。快速且低成本 —— 这里刻意使用了 Haiku。 **阶段 3 — 上下文收集** *(Claude Haiku + 工具调用)*：三个并行的工具调用： - `fetch_logs` → 从 Elasticsearch 或 Loki 获取错误/警告行 - `fetch_metrics` → 从 Prometheus 获取当前指标与 7 天前的基对比对 - `fetch_deployments` → 获取受影响服务最近的 GitHub Deployments **阶段 4 — 根本原因分析** *(Claude Sonnet)*：生成一个包含 `primary_cause`、`contributing_factors`、`evidence` (具体日志行) 以及 `confidence` 分数的 `RootCauseHypothesis`。置信度门槛：如果分数 < 0.7，图会循环回到上下文收集阶段，以获取更广泛的证据窗口。在达到 `RCA_MAX_ATTEMPTS` 之后，系统会自动升级。 **阶段 5 — 修复计划** *(Claude Sonnet)*：包含人类可读描述、精确回滚命令、风险级别和预计持续时间的有序步骤。 **阶段 6 — 人工批准** *(你)*：LangGraph 的 `interrupt()` 会暂停图。状态会被 checkpoint 到 PostgreSQL —— 服务器可以重启，而事件状态会保留。你会收到一条带有“批准 / 拒绝”按钮的 Slack 消息，或者可以直接调用 API。 **阶段 7 — 应用修复**：按顺序执行步骤。如果发生失败，已完成的步骤会使用其存储的 `rollback_command` 按相反顺序回滚。 **阶段 8 — 事后复盘** *(Claude Sonnet)*：包含时间线、根本原因、解决方式和行动项。无论结果如何（已解决或已升级）都会生成。存储在 PostgreSQL 中，并发布到 Slack。 ## 我们不盲信模型大多数 AI 智能体项目都会按表面价值全盘接受模型的输出。IRAS 不会。 **安全不变量在代码中强制执行，而不是在 prompt 中：** ``` # 模型无法生成绕过批准的不安全计划。 # 无论模型返回什么，这些检查都会运行。 if any(step.risk_level == "high" for step in plan.steps): plan.requires_human_approval = True # forced if any(not step.rollback_command.strip() for step in plan.steps): plan.reversible = False # forced plan.requires_human_approval = True # forced ``` **全面的测试套件** —— 包含专门设计用于测试模型不当行为的对抗性场景： - 模型在 `risk_level` 上撒谎 → 被安全覆盖机制捕获 - 模型返回空的 `rollback_command` → 计划被阻止 - 所有上下文工具同时失败 → 优雅降级，而不是崩溃 - 20 个并发事件 → 零状态污染 - Unicode、XSS、10,000 字符的 payload → 完美处理 - 模型置信度始终未达到阈值 → 自动升级至 PagerDuty ``` pytest -q # all tests pytest tests/stress/ -v --no-cov # adversarial scenarios pytest --cov=src/iras --cov-report=html # coverage report ``` ## Runbook 技能注册表 (扩展) 真实的事件响应绝不是一刀切的。错误率飙升、延迟激增、数据库连接饱和以及糟糕的部署，每一种情况都需要不同的证据、工具优先级和安全约束。此扩展添加了一个 **可插拔的 Runbook 技能注册表**，它可以根据告警上下文动态选择特定事件类型的 runbook，并利用它来指导整个事件响应工作流。 ### 什么是技能技能是针对某一种事件类型的结构化 runbook 包 —— 一个带有 YAML 风格 frontmatter 的单一 `SKILL.md` 文件： ``` --- name: high-error-rate description: Runbook skill for production services with elevated HTTP 5xx ... triggers: - error rate - 5xx - http error - exception required_evidence: - current error rate vs baseline - top error logs in affected time window - recent deployments for affected service preferred_tools: - fetch_metrics - fetch_logs - fetch_deployments approval_required_for: - production rollback - production restart forbidden_actions: - modify database automatically - restart production service without approval --- ``` 技能仅用于 **指导** —— 它永远不会替换节点、跳过阶段或授权执行。 ### 技能选择 (无需 LLM) 选择器对告警文本字段 (`title`、`description`、`service`、`metric`、`tags` 等) 使用 **确定性的关键词匹配**： | 告警包含 | 选中的技能 | |---|---| | `5xx`, `error rate`, `http error`, `exception` | `high-error-rate` | | `latency`, `p95`, `p99`, `timeout`, `slow` | `latency-spike` | | `database`, `db`, `connection pool`, `too many connections` | `database-connection-saturation` | | `deployment`, `release`, `rollback`, `new version` | `deployment-rollback` | 未知告工会安全回退 —— 不崩溃、不猜测，使用带有保守修复措施通用工作流。 ### 演示 ``` python scripts/demo_skill_selection.py examples/skill_demo_alerts/high_error_rate.json ``` 输出： ``` Selected Runbook Skill: high-error-rate Description: Runbook skill for production services with elevated HTTP 5xx ... Match score: 0.5 Matched triggers: error rate, 5xx, http error Required evidence: - current error rate vs baseline - top error logs in affected time window ... Requires human approval for: - production rollback - production restart - traffic shifting Forbidden actions: - modify database automatically ... These are recommendations only. The full incident workflow still runs every stage and the human approval gate is always enforced. ``` ### 安全规则技能可以 **推荐** 某项操作。技能不能 **授权** 直接执行。 | 规则 | 强制执行者 | |---|---| | 生产环境回滚需要人工批准 | 技能字段 `approval_required_for` + LangGraph interrupt | | 生产环境重启需要人工批准 | 技能字段 `approval_required_for` + LangGraph interrupt | | 默认情况下禁止数据库变更 | 技能字段 `forbidden_actions` | | 未知告警安全回退，不进行激进修复 | 选择器回退 + 格式化安全块 | | 技能选择绝不会绕过批准 | 图拓扑保持不变；始终运行批准节点 | | 技能选择绝不会绕过安全检查 | `RemediationPlan.requires_human_approval` 在代码中计算 | ### 添加新技能在 `skills/` 下新建一个目录放入文件即可： ``` skills/ ├── high-error-rate/ │ └── SKILL.md ├── latency-spike/ │ └── SKILL.md ├── database-connection-saturation/ │ └── SKILL.md ├── deployment-rollback/ │ └── SKILL.md └── your-new-incident-type/ └── SKILL.md # ← add this ``` 注册表会自动发现它。无需修改代码。 ## 架构 ### 系统概述 ``` graph TB subgraph Sources["Alert Sources"] PD[PagerDuty] PROM[Prometheus AlertManager] DD[Datadog / Grafana] ANY[Any JSON Webhook] end subgraph API["FastAPI REST API"] WH["POST /webhook/alert"] APR["POST /incidents/{id}/approve"] REJ["POST /incidents/{id}/reject"] HLT["GET /health"] end subgraph Graph["LangGraph State Machine"] ING[Ingestion] TRI["Triage Agent\nClaude Haiku"] SKILL["Skill Selector\ndeterministic"] CTX["Context Gathering\n+ Skill Guidance"] RCA["RCA Agent\n+ Skill Guidance"] GEN["Generate Plan\n+ Skill Constraints"] APP["Approval\n⏸ interrupt"] REM[Apply Remediation] ESC[Escalation] PM["Post-mortem Agent\nClaude Sonnet"] end subgraph Integrations["External Integrations"] SL[Slack] PG2[PagerDuty] LOGS["Elasticsearch / Loki"] METRICS[Prometheus] DEPLOY[GitHub Deployments] DB[(PostgreSQL)] end Sources --> WH WH --> ING ING --> TRI --> SKILL --> CTX --> RCA RCA -->|"conf < 0.7, attempts < max"| CTX RCA -->|"conf >= 0.7"| GEN RCA -->|"attempts exhausted"| ESC GEN --> APP APP -->|approved| REM APP -->|rejected| ESC REM --> PM ESC --> PM PM --> DB PM --> SL APR --> APP REJ --> APP CTX --> LOGS & METRICS & DEPLOY ESC --> PG2 & SL GEN --> SL Graph --> DB ``` ### 请求生命周期 ``` sequenceDiagram actor Monitor as Monitoring System participant API as FastAPI participant Graph as LangGraph participant Claude as Claude (Anthropic) participant Tools as External Tools participant DB as PostgreSQL participant Slack as Slack actor Human as On-Call Engineer Monitor->>API: POST /webhook/alert API-->>Monitor: 202 {"incident_id": "abc123"} API->>Graph: ainvoke(state, thread_id="abc123") Graph->>Graph: ingestion — validate + init state Graph->>Claude: triage_agent Claude-->>Graph: TriageResult {severity: P1, confidence: 0.9} Graph->>Graph: skill_selector — deterministic keyword match Graph->>Graph: attach SkillContext to state Graph->>Claude: context_agent (tool-calling + skill guidance) Claude->>Tools: fetch_logs() + fetch_metrics() + fetch_deployments() Tools-->>Claude: raw evidence Claude-->>Graph: ContextBundle {logs, metrics, deployments} Graph->>Claude: rca_agent (with skill checklist) Claude-->>Graph: RootCauseHypothesis {confidence: 0.88} ✓ Graph->>Claude: remediation_agent (with skill constraints) Claude-->>Graph: RemediationPlan {3 steps + rollbacks} Graph->>Slack: Post approval with [Approve] [Reject] Note over Graph,DB: interrupt() — graph pauses, state checkpointed to PostgreSQL Human->>API: POST /incidents/abc123/approve API->>Graph: Command(resume={"approved": True}) Graph->>Graph: apply_remediation — execute all steps Graph->>Claude: postmortem_agent Claude-->>Graph: PostMortem {timeline, root_cause, action_items} Graph->>DB: INSERT INTO postmortems Graph->>Slack: Post post-mortem summary ``` ### 项目结构 ``` IRAS/ ├── src/iras/ │ ├── api/ │ │ ├── app.py # FastAPI lifespan: init checkpointer → build graph │ │ ├── background.py # Approval timeout monitor │ │ └── routes/ │ │ ├── webhook.py # POST /webhook/alert │ │ └── approval.py # POST /incidents/{id}/approve|reject │ │ │ ├── graph/ │ │ ├── builder.py # Wire 9 nodes + conditional edges → compile │ │ ├── checkpointer.py # AsyncPostgresSaver (singleton + asyncio.Lock) │ │ ├── state.py # IncidentState TypedDict │ │ └── nodes/ │ │ ├── ingestion.py │ │ ├── triage.py # → Claude Haiku │ │ ├── context_gathering.py # → Claude Haiku + tool calls │ │ ├── rca.py # → Claude Sonnet + retry routing │ │ ├── generate_plan.py # → Claude Sonnet + Slack notify │ │ ├── approval.py # interrupt() — durable human checkpoint │ │ ├── apply_remediation.py # Execute steps + rollback on failure │ │ ├── escalation.py # PagerDuty + Slack │ │ └── postmortem.py # → Claude Sonnet + persist to DB │ │ │ ├── agents/ # One Pydantic AI agent per stage │ │ ├── triage.py # Claude Haiku — fast classification │ │ ├── context_gathering.py # Claude Haiku — tool-calling │ │ ├── rca.py # Claude Sonnet — deep reasoning │ │ ├── remediation.py # Claude Sonnet — plan generation │ │ └── postmortem.py # Claude Sonnet — incident summary │ │ │ ├── skills/ # ★ Runbook Skill Registry (extension) │ │ ├── models.py # RunbookSkill, SkillSelectionResult │ │ ├── registry.py # Load + validate skills/*/SKILL.md │ │ ├── selector.py # Deterministic alert-to-skill matching │ │ └── formatter.py # Render skill as context block for agents │ │ │ ├── models/ │ │ └── incident.py # TriageResult · ContextBundle · RootCauseHypothesis │ │ # RemediationPlan · RemediationStep · PostMortem │ ├── tools/ # Elasticsearch · Loki · Prometheus · GitHub · Slack · PagerDuty │ └── config/settings.py # Pydantic Settings — reads .env │ ├── skills/ # ★ Pluggable runbook skill files │ ├── high-error-rate/SKILL.md │ ├── latency-spike/SKILL.md │ ├── database-connection-saturation/SKILL.md │ └── deployment-rollback/SKILL.md │ ├── examples/ │ └── skill_demo_alerts/ # Demo alert payloads for each skill │ ├── high_error_rate.json │ ├── latency_spike.json │ ├── db_connection_saturation.json │ └── deployment_rollback.json │ ├── scripts/ │ └── demo_skill_selection.py # Standalone skill demo │ ├── tests/ │ ├── unit/ # Fully mocked │ │ ├── agents/ # Agent output shape + error path tests │ │ ├── test_skills_registry.py # ★ Skill loading + validation │ │ ├── test_skills_selector.py # ★ Alert-to-skill matching │ │ └── test_skills_formatter.py # ★ Context block output │ ├── integration/ # Live service tests (opt-in) │ │ └── nodes/ # Per-node integration tests │ ├── e2e/ # Full graph runs with MemorySaver │ └── stress/ # Adversarial + real-world scenarios │ ├── docs/ │ ├── SPEC.md # Extension specification │ ├── ROADMAP.md # Development roadmap │ ├── IRAS_AUDIT.md # Phase 0 project audit │ └── SKILL_REGISTRY_DESIGN.md # Skill registry design rationale │ ├── docker-compose.yml # Postgres + IRAS API + dev profile ├── Dockerfile # Production image ├── Dockerfile.dev # Dev image (hot reload) ├── pyproject.toml # Hatchling build + deps + tool config └── run.py # Development server launcher ``` ## Interrupt 模式 IRAS 在技术上最有趣的部分是它如何处理人工介入的批准步骤。 LangGraph 的 `interrupt()` 真正实现了 **在执行过程中暂停**，将其整个状态序列化到 PostgreSQL 中，并在人类响应时从完全相同的断点恢复执行 —— 即使经历了服务器重启、部署或进程崩溃也不受影响。 ``` # 图在此处暂停。状态保存在 Postgres 中。 # 服务器可以重启。事件是安全的。 human_decision = interrupt({"message": "Approve remediation plan?"}) # 当调用 POST /incidents/{id}/approve 时在此处恢复。 if human_decision["approved"]: return apply_remediation(state) else: return escalate(state) ``` 这就是为什么 IRAS 是基于 LangGraph 构建的，而不是一个更简单的框架。持久化执行对于生产环境的事件响应至关重要。 ## 严重程度与升级 | 严重程度 | 含义 | 批准窗口 | |---|---|---| | P0 | 完全中断 | 15 分钟，随后自动升级 | | P1 | 严重降级 | 2 小时 | | P2 | 部分降级 | 2 小时 | | P3 | 警告 / 信息 | 2 小时 | 触发升级的条件：在达到最大重试次数后 RCA 置信度仍未达到阈值 · 人工拒绝计划 · 批准超时。发生升级时：触发幂等的 PagerDuty 事件 + 发送带有完整上下文的结构化 Slack 消息。无论是否解决，事后复盘都会运行。 ## 配置 ``` cp .env.example .env ``` | 变量 | 必需 | 描述 | |---|---|---| | `ANTHROPIC_API_KEY` | ✅ | Claude API key (`sk-ant-...`) | | `POSTGRES_URL` | ✅ | `postgresql://user:pass@host:5432/db` | | `SLACK_BOT_TOKEN` | ⬜ | 未设置则回退到 mock 客户端 | | `SLACK_ONCALL_CHANNEL_ID` | ⬜ | 用于值班告警的 Slack 频道 | | `PAGERDUTY_INTEGRATION_KEY` | ⬜ | 未设置则回退到 mock 客户端 | | `PROMETHEUS_BASE_URL` | ⬜ | 未设置则回退到 mock 客户端 | | `ELASTICSEARCH_BASE_URL` | ⬜ | 二选一日志后端 | | `LOKI_BASE_URL` | ⬜ | 二选一日志后端 | | `LANGSMITH_API_KEY` | ⬜ | LangSmith 图追踪 | | `LOGFIRE_TOKEN` | ⬜ | Logfire 智能体追踪 | | `RCA_CONFIDENCE_THRESHOLD` | ⬜ | 默认值：`0.7` | | `RCA_MAX_ATTEMPTS` | ⬜ | 默认值：`3` | | `APPROVAL_TIMEOUT_P0_MINUTES` | ⬜ | 默认值：`15` | | `APPROVAL_TIMEOUT_DEFAULT_MINUTES` | ⬜ | 默认值：`120` | ## API 参考 ### `POST /webhook/alert` 接受任何带有 `title` + `timestamp` 的 JSON。所有额外字段都会直接传递给 AI 智能体。 ``` { "title": "High error rate on payment-service", "timestamp": "2026-05-03T10:30:00Z" } ``` ``` { "incident_id": "550e8400-...", "status": "processing" } ``` ### `POST /incidents/{id}/approve` ### `POST /incidents/{id}/reject` 恢复暂停的图。批准会路由到修复流程。拒绝会路由到 PagerDuty 升级。 ### `GET /health` ``` { "status": "ok", "env": "development" } ``` ## 部署 ``` # docker-compose.yml services: api: build: . ports: ["8000:8000"] env_file: .env depends_on: postgres: condition: service_healthy postgres: image: postgres:16 environment: POSTGRES_USER: iras POSTGRES_PASSWORD: secret POSTGRES_DB: iras volumes: - postgres_data:/var/lib/postgresql/data volumes: postgres_data: ``` ``` docker compose up -d ``` **生产环境检查清单：** - [ ] 为 `/approve` 和 `/reject` 添加认证 (Slack 请求签名或 OAuth) - [ ] 设置 `APP_ENV=production` - [ ] 配置真实的 Slack + PagerDuty token - [ ] 启用 LangSmith + Logfire - [ ] 添加带有 TLS 的反向代理 (nginx / Caddy) - [ ] 设置 PgBouncer 进行 Postgres 连接池化 ## 可观测性 | 信号 | 工具 | 涵盖内容 | |---|---|---| | 图追踪 | LangSmith | 每个节点：输入、输出、时序、token 使用量 | | 智能体追踪 | Logfire | 每次 LLM 调用：prompt、响应、工具调用、验证 | | 结构化日志 | Python `logging` | 每个节点都会输出 `incident_id`, `node_name`, `timestamp` | | 事后复盘 | PostgreSQL 可按严重程度、持续时间、结果查询的完整记录 | ## 扩展 IRAS **添加上下文工具**：在 `src/iras/tools/` 中实现一个带有 `MockXClient` 回退的客户端 → 添加到 `ContextDeps` → 注册 `@context_agent.tool`。 **为每个智能体更换模型**：每个智能体都会实例化自己的 `pydantic_ai.Agent`。 ``` rca_agent = Agent(model="claude-opus-4-5", ...) # higher accuracy triage_agent = Agent(model="claude-haiku-3-5", ...) # faster/cheaper ``` **添加新的事件类型**：放入一个带有 frontmatter 的 `skills//SKILL.md` → 会被注册表自动发现。无需修改代码。 ## 贡献 1. Fork 并创建功能分支 2. 运行 `pytest` —— 所有测试必须通过 3. 保持高测试覆盖率：`pytest --cov=src/iras --cov-fail-under=98` 4. 发起 Pull Request ## 个人简历展示本项目描述为： ## 许可证 MIT —— 详见 [LICENSE](LICENSE)

使用 [LangGraph](https://langchain-ai.github.io/langgraph/) · [Pydantic AI](https://ai.pydantic.dev) · [FastAPI](https://fastapi.tiangolo.com) · [Claude](https://anthropic.com) 构建
**如果 IRAS 帮你处理了凌晨 3 点的事件，请给它点个 ⭐**

标签：AIOps, LangGraph, LLM代理, 事故响应, 根因分析, 测试用例, 自定义请求头, 请求拦截, 运维自动化, 逆向工具