MrEinsteinE/cloud-incident-response-openenv

GitHub: MrEinsteinE/cloud-incident-response-openenv

一个基于OpenEnv规范的云端SRE事件响应仿真环境，通过9个跨难度场景训练和评估AI智能体在分布式微服务故障诊断与修复中的推理决策能力。

Stars: 0 | Forks: 0

title: 云端事件响应 OpenEnv emoji: 🚨 colorFrom: red colorTo: yellow sdk: docker app_port: 7860 pinned: false tags: - openenv - sre - cloud - incident-response - devops - real-world - agentic # ☁️ Cloud Incident Response — OpenEnv 环境一个用于在 **云端 SRE 事件响应** 方面训练和评估 AI 智能体的 OpenEnv 环境 —— 这是每位云公司工程师每天执行的真实的待命工作流。与 Kubernetes 运维环境不同：本环境专注于分布式微服务架构中的 **跨服务级联故障** —— 由失控的分析查询导致的 OOM kills、隔离可用区的 BGP 网络分区，以及将过期 secrets 推送到生产服务的凭据轮换 bug。 ## OpenEnv 接口本环境使用类型化的 Pydantic 模型实现了 **完整的 OpenEnv 规范**： | 方法 | 端点 | 输入 | 返回值 | |---|---|---|---| | `POST` | `/reset` | `{"task_id": "...", "scenario_index": 0}` 或 `{}` | `Observation` | | `POST` | `/step` | `Action` JSON 主体 | `{observation, reward, done, info}` | | `GET` | `/state` | — | `EpisodeState` | | `GET` | `/health` | — | `{"status": "ok"}` | | `GET` | `/tasks` | — | 任务列表 + 动作模式 | | `GET` | `/grader` | — | 分数 0.0–1.0 及详细明细 | | `POST` | `/baseline` | — | 运行 inference.py，返回分数 | ### 类型化模型 ``` # Action — 由 agent 提交 Action { action_type: str, # e.g. "query_logs", "restart_service", "submit_severity" parameters: { service?: str, # Target service name severity?: str, # P1|P2|P3|P4 (for submit_severity) failure_mode?: str, # Root cause description (for submit_root_cause) summary?: str, # Resolution summary (for submit_resolution) flag?: str, # Feature flag name (for disable_feature_flag) runbook_action?: str, # Runbook step (for execute_runbook_step) target_version?: str, # Deploy version (for rollback_deploy) } } # Observation — 返回给 agent Observation { episode_id: str, # Unique episode UUID task_id: str, # Active task scenario_id: str, # Current scenario (e.g. "AC-001") step_count: int, # Steps taken so far max_steps: int, # Budget (3, 10, or 15) incident_summary: str, # Plain-text incident description alert: dict, # Alert payload: title, symptoms, error_rate, revenue_impact available_actions: [str], # Valid action types for this task queried_data: dict, # All evidence gathered so far known_services: [str], # Valid service names for actions cumulative_reward: float, # Running reward total done: bool, # Episode complete flag feedback: str, # Per-step reward explanation last_action_error: str?, # Error from last action (null if OK) } # Reward — 在每个步骤之后返回 Reward { score: float, # Step reward value value: float, # Alias for score (backward compatibility) reason: str, # Human-readable explanation cumulative: float, # Running total } ``` ## 任务（3 个难度级别，9 个场景） | 任务 ID | 难度 | 最大步数 | 场景数 | 智能体执行内容 | |---|---|---|---|---| | `alert_classification` | 🟢 简单 | 3 | 3 | 根据指标和症状对告警严重程度 P1–P4 进行分类 | | `root_cause_analysis` | 🟡 中等 | 10 | 3 | 跨 8 个服务追踪故障链以寻找根因 | | `remediation_planning` | 🔴 困难 | 15 | 3 | 诊断 + 执行多步修复 + 记录解决方案 | ### 场景详情 | ID | 事件 | 根因 | 挑战 | |---|---|---|---| | AC-001 | DB 连接池耗尽 | — | 明确的 P1：78% 错误率，每分钟损失 $12k | | AC-002 | CDN 缓存失效风暴 | — | 模棱两可的 P2：性能降级但结账功能正常 | | AC-003 | 推荐引擎错误 | — | 陷阱 P3：45% 错误率但零收入影响 | | RCA-001 | Postgres OOM 崩溃循环 | analytics-service（无限制查询） | 根因不在告警中，需要调查 8 个服务 | | RCA-002 | 跨可用区结账失败 | network-infra（BGP 路由撤销） | 伪装成应用故障的网络问题 | | RCA-003 | DB 身份验证失败 | config-service（过期凭据轮换） | 其他服务上存在多个误导性部署 | | RP-001 | 完整 OOM 事件 | analytics-service | 6 步修复序列，错误动作将受惩罚 | | RP-002 | 完整 BGP 事件 | network-infra | 4 步 runbook + 配置回滚，涉及 8 个服务 | | RP-003 | 完整凭据事件 | config-service | 7 步序列，凭据轮换 + 服务重启 | ### 为什么这具有真正的挑战性 - **中等难度**：根因服务绝不在告警的 `affected_services` 中。智能体必须查询受害服务的日志，循着指明罪魁祸首的蛛丝马迹，然后再去调查那个服务。包含 8 个已知服务和误导性部署。 - **困难难度**：同样的诊断挑战，外加必须按逻辑顺序执行 4-7 个修复动作。错误的动作（例如重启一个健康的服务）会受到 -0.15 的惩罚。解决摘要必须提及具体的服务和动作。 ### 基准分数 | 模型 | 简单 | 中等 | 困难 | 总体 | |---|---|---|---|---| | `llama-3.1-8b-instant` | 1.0 | 0.65 | 0.70 | 0.78 | | `llama-3.3-70b-versatile` | 1.0 | 0.99 | 0.80 | 0.93 | 70B 模型在中等/困难任务上持续优于 8B 模型，证明该环境能够有效区分模型质量。 ## 动作空间 ### 🔍 诊断动作（收集证据） ``` {"action_type": "query_logs", "parameters": {"service": "postgres-db"}} {"action_type": "check_metrics", "parameters": {"service": "auth-service"}} {"action_type": "check_dependencies", "parameters": {"service": "api-gateway"}} {"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}} {"action_type": "check_service_status", "parameters": {"service": "payment-service"}} ``` ### 🔧 修复动作（修复事件） ``` {"action_type": "restart_service", "parameters": {"service": "postgres-db"}} {"action_type": "rollback_deploy", "parameters": {"service": "config-service", "target_version": "previous"}} {"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}} {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}} {"action_type": "scale_service", "parameters": {"service": "image-service", "replicas": 10}} {"action_type": "clear_cache", "parameters": {"service": "redis-session"}} ``` ### 📝 提交动作（结束回合） ``` {"action_type": "submit_severity", "parameters": {"severity": "P1", "service": "api-gateway"}} {"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM killing postgres-db"}} {"action_type": "submit_resolution", "parameters": {"summary": "3+ sentence description of what failed, what you did, and current status"}} ``` ## 奖励函数密集的奖励塑形在 **整个轨迹**（而不仅是回合结束时的二元结果）上提供信号： | 信号 | 奖励 | 描述 | |---|---|---| | 查询新服务 | +0.03 到 +0.04 | 首次对某个服务执行诊断动作 | | 查询新动作类型 | +0.01 到 +0.02 | 对已查询的服务执行不同的诊断 | | 重复相同查询 | −0.03 到 −0.04 | 再次执行相同的 (动作, 服务) 对 | | 未知服务 | −0.05 到 −0.06 | 服务不在 known_services 中 | | 正确修复 | +0.06 | 动作符合正确的修复序列 | | 错误修复 | −0.12 到 −0.15 | 动作在 wrong_actions 列表中（例如重启健康的服务） | | 正确的提交类型 | +0.02 | 对任务使用了正确的提交动作 | | 错误的提交类型 | −0.08 到 −0.12 | 例如在 remediation_planning 期间使用 submit_severity | | 超过中点（非提交） | −0.015 到 −0.04 | 每步效率惩罚 | | 超时 | −0.15 到 −0.20 | 在 max_steps 之前未提交 | | 完全重复动作 | −0.04 到 −0.05 | 与之前步骤具有相同的动作+参数 | | **评分器得分** | **0.0–1.0** | **在终止步骤时添加** | ### 评分（确定性，可复现，0.0–1.0） | 任务 | 评分逻辑 | |---|---| | `alert_classification` | 1.0 完全匹配 · 0.5 相邻级别 (P1↔P2) · 0.25 相差两级 · 0.0 错误 | | `root_cause_analysis` | 0.6 基础分（正确的服务 + 故障模式）+ 最高 0.4 效率加分 | | `remediation_planning` | 0.6 基础分 + 0.3 效率（匹配的正确步骤）− 0.15 惩罚（错误动作）+ 0.1 摘要质量 | ## 🖥️ 交互式 UI 演练位于 `/` 的 Gradio UI 提供了用于人工评估的可视化界面。以下是使用方法： ### 🟢 简单任务：告警分类 1. **选择任务**：从任务下拉菜单中选择 `🟢 Easy — Alert Classification` 2. **选择场景**：选择 `Scenario 2`（棘手的 P3 陷阱） 3. **点击** `🔄 Reset Environment` 4. **阅读**观察面板 —— recommendation-service 错误率为 45% 5. **调查**：将 Action Type 设置为 `🔍 check_metrics`，Service 设置为 `recommendation-service`，点击 `▶️ Execute Action` 6. **阅读证据** —— "User impact: NONE"、"Revenue: unchanged"、"Checkout: 100%" 7. **提交**：将 Action Type 设置为 `📝 submit_severity`，展开 `📋 Parameters`，将 Severity 设置为 `P3 Medium`，点击 `▶️ Execute Action` 8. **评分**：点击 `📊 Grade` —— 对精确匹配的 P3 应显示 **1.0** ### 🟡 中等任务：根因分析 1. **选择任务**：`🟡 Medium — Root Cause Analysis`，**场景**：`Scenario 0` 2. **点击** `🔄 Reset Environment` 3. **阅读**观察结果 —— postgres-db 崩溃循环，多个服务宕机 4. **查询受害者**：Action Type `🔍 query_logs`，Service `postgres-db`，点击 `▶️ Execute Action` 5. **阅读证据** —— 日志显示 *"query from analytics-service consuming all memory"* 6. **顺藤摸瓜**：Action Type `🔍 query_logs`，Service `analytics-service`，点击 `▶️ Execute Action` 7. **阅读证据** —— "full_history_export job"、"847M row scan"、"no LIMIT" 8. **确认**：Action Type `🔍 check_recent_deploys`，Service `analytics-service`，点击 `▶️ Execute Action` 9. **阅读证据** —— "Deploy 6h ago: cross-table JOIN without LIMIT clause" 10. **提交**：Action Type `📝 submit_root_cause`，Service `analytics-service`，Failure Mode: `unbounded query OOM killing postgres-db`，点击 `▶️ Execute Action` 11. **评分**：点击 `📊 Grade` —— 应显示 **0.85–1.0** ### 🔴 困难任务：修复计划 1. **选择任务**：`🔴 Hard — Remediation Planning`，**场景**：`Scenario 0` 2. **点击** `🔄 Reset Environment` 3. **诊断**：对 `postgres-db` 执行 `🔍 query_logs` → 看到 "analytics-service" 的线索 4. **确认**：对 `analytics-service` 执行 `🔍 query_logs` → 看到 "full_history_export, no LIMIT" 5. **修复步骤 1**：`🔧 disable_feature_flag`，Flag: `full_history_export` → "job DISABLED" 6. **修复步骤 2**：对 `analytics-service` 执行 `🔧 restart_service` → "restarted — idle" 7. **修复步骤 3**：对 `postgres-db` 执行 `🔧 restart_service` → "accepting connections (12/500)" 8. **修复步骤 4**：对 `auth-service` 执行 `🔧 restart_service` → "reconnected OK" 9. **修复步骤 5**：对 `order-service` 执行 `🔧 restart_service` → "writes resuming" 10. **验证**：`🔧 execute_runbook_step`，Runbook Action: `verify_db_health` → "healthy" 11. **提交**：`📝 submit_resolution`，Summary: *"The analytics-service deployed a full_history_export job with an unbounded query that OOM-killed postgres-db. We disabled the full_history_export flag, restarted analytics-service, then restarted postgres-db, auth-service, and order-service. All services recovered and postgres-db is healthy."* 12. **评分**：点击 `📊 Grade` —— 应显示 **0.85–1.0** ### UI 控件参考 | 按钮 | 用途 | |---|---| | `🔄 Reset Environment` | 开始新的回合 | | `▶️ Execute Action` | 运行选定的动作 | | `📋 Parameters` | 展开以填写 severity / failure_mode / summary / flag / runbook 字段 | | `📊 Grade` | 在回合结束后查看最终评分器得分 (0.0–1.0) | | `📋 State` | 刷新状态面板 | ### 常见错误与惩罚 | 错误 | 惩罚 | 原因 | |---|---|---| | 错误的提交类型（例如在困难任务中使用 `submit_severity`） | −0.12 | 每个任务只有一个正确的提交动作 | | 重启健康的服务（例如 `restart redis-session`） | −0.15 | 错误的修复动作 | | 查询不在 `known_services` 中的服务 | −0.06 | 无效目标 | | 重复完全相同的动作 | −0.04 | 死循环检测 | | 在达到最大步数前未提交 | −0.20 | 超时惩罚 | | 在简单任务中使用修复动作 | −0.08 | 不可用于告警分类 | ## API 用法 ### 快速测试 ``` # 使用默认值重置 (alert_classification, scenario 0) curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" -d '{}' # 使用特定任务重置 curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "root_cause_analysis", "scenario_index": 1}' # 执行一步 curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"action_type": "query_logs", "parameters": {"service": "postgres-db"}}' # 检查状态 curl http://localhost:7860/state # 对当前 episode 评分 curl http://localhost:7860/grader ``` ### 完整回合示例（Python） ``` import requests BASE = "http://localhost:7860" # 启动 episode obs = requests.post(f"{BASE}/reset", json={ "task_id": "alert_classification", "scenario_index": 0 }).json() print(f"Incident: {obs['incident_summary']}") print(f"Services: {obs['known_services']}") # 调查 result = requests.post(f"{BASE}/step", json={ "action_type": "check_metrics", "parameters": {"service": obs["known_services"][0]} }).json() print(f"Reward: {result['reward']['score']:+.3f}") print(f"Done: {result['done']}") # 提交 result = requests.post(f"{BASE}/step", json={ "action_type": "submit_severity", "parameters": {"severity": "P1", "service": obs["known_services"][0]} }).json() # 评分 grade = requests.get(f"{BASE}/grader").json() print(f"Score: {grade['total']}") ``` ## 设置 ### 本地开发 ``` pip install -r requirements.txt uvicorn server.app:app --host 0.0.0.0 --port 7860 ``` ### Docker ``` docker build -t cloud-incident-env . docker run -p 7860:7860 cloud-incident-env ``` ### 运行基准智能体 ``` export API_BASE_URL="https://api.groq.com/openai/v1" export MODEL_NAME="llama-3.1-8b-instant" export HF_TOKEN="gsk_your_groq_key" python inference.py ``` ## 项目结构 ``` ├── Dockerfile # Docker build for HF Spaces ├── README.md # This file ├── requirements.txt # Python dependencies ├── openenv.yaml # OpenEnv manifest (tasks, endpoints) ├── pyproject.toml # Project metadata ├── tasks.py # 9 scenarios across 3 difficulty levels ├── graders.py # Deterministic graders (0.0–1.0) ├── inference.py # Baseline LLM agent with fallback logic └── server/ ├── __init__.py ├── app.py # FastAPI + Gradio endpoints ├── environment.py # Core step/reset/state logic + reward shaping └── models.py # Typed Pydantic models (Action, Observation, Reward) ``` ## 设计 ### 为什么选择 Cloud Incident Response？每家云公司都雇佣 SRE 来应对生产事件，他们常常在时间紧迫且信息不完整的情况下进行响应。这是一项 AI 智能体应该学习的普遍且高价值的技能。该环境模拟了精确的决策循环：分诊 → 调查 → 诊断 → 修复 → 记录。 ### 为什么选择这些特定的事件？ - **OOM kills** (RCA-001, RP-001)：最常见的数据库故障模式 —— 失控的查询消耗了所有内存并导致 DB 崩溃，进而拖垮每个依赖它的服务。 - **BGP partitions** (RCA-002, RP-002)：看起来像应用故障的网络层故障 —— 服务显示为“宕机”，但实际上是健康的，只是无法访问。 - **Credential rotation bugs** (RCA-003, RP-003)：导致级联身份验证失败的配置管理故障 —— DB 没问题，但客户端的密码是错误的。 ### 为什么使用密集奖励？稀疏奖励（仅在回合结束时提供）无法为 RL 智能体提供学习信号。我们的奖励函数在 **每一步** 都提供反馈：对有用的调查给予正向反馈，对浪费的动作给予负向反馈，并在之上叠加最终的评分器得分。这不仅支持 RL 训练，还能用于 LLM 智能体评估。

标签：AI智能体, AI智能体训练, BGP网络分区, Docker, On-Call, OOM Kill, OpenEnv, Petitpotam, Pydantic, Python, SRE, Sysdig, 事故响应演练, 云事故响应, 人工智能, 仿真环境, 偏差过滤, 分布式系统, 响应大小分析, 大模型, 安全防御评估, 微服务故障, 微服务架构, 故障排查, 无后门, 用户模式Hook绕过, 秘钥轮换, 站点可靠性工程, 系统恢复, 级联故障, 自动化运维, 评估框架, 请求拦截, 运维自动化, 逆向工具