codergithub07/incident-response-env

GitHub: codergithub07/incident-response-env

一个符合 OpenEnv 标准的强化学习环境,用于训练和评估 AI Agent 模拟 SRE 工程师完成 IT 事故的分流、诊断、修复和沟通全流程。

Stars: 0 | Forks: 0

title: IT Incident Response OpenEnv emoji: 🚨 colorFrom: red colorTo: red sdk: docker app_port: 7860 pinned: false tags: - openenv - reinforcement-learning - agent-training - sre - incident-response license: mit # 🚨 IT Incident Response — OpenEnv Environment 一个符合 OpenEnv 标准的 RL 环境,模拟了值班 Site Reliability Engineer (SRE) 处理生产事故的真实工作。 Agent 必须对事故进行分流、调查、诊断、修复和沟通 —— 就像人类工程师一样 —— 并根据每个操作的质量和准确性获得相应的奖励。 ## 🎯 动机 每家科技公司都会因生产事故而损失工程时间和收入。训练 AI Agent 辅助事故响应是 Agentic RL 最具商业价值的应用之一。该环境为此类训练提供了一个可控、可复现的沙盒,包含逼真的场景、部分进度奖励和具有挑战性的前沿模型难度级别。 ## 📋 任务描述 | Task | Difficulty | Max Steps | Description | |------|-----------|-----------|-------------| | `classify` | 🟢 Easy | 1 | 根据事故警报,分配严重等级 (P0–P3)、事故类型和受影响服务 | | `diagnose` | 🟡 Medium | 5 | 调查日志和指标,然后识别根本原因及支持证据 | | `resolve` | 🔴 Hard | 10 | 完整流程:分类 → 调查 → 诊断 → 修复 → 沟通 | ### 任务详情 **classify** — Agent 接收一个标题、描述和一组监控警报。它必须在单次操作中发出 `{"action_type": "classify", "severity": "P1", "incident_type": "availability", "affected_service": "payment-service"}`。每个字段均有部分分,严重等级允许一级偏差的容错。 **diagnose** — Agent 看到相同的初始上下文以及更深入的信号。它可以使用 `query_logs` 和 `query_metrics` 来检索相关的日志行和指标快照,然后必须提交一个包含 `root_cause`、`affected_components` 和 `evidence` 的 `diagnose` 操作。调查步骤越少 → 效率奖励越高。 **resolve** — 一个多阶段 Episode。Agent 必须按顺序执行: 1. `classify` 事故 (阶段:分流) 2. (可选) 通过 `query_logs` / `query_metrics` 进行调查 (阶段:调查) 3. `diagnose` 根本原因 (阶段:诊断) 4. `remediate` 并制定分步修复计划 (阶段:修复) 5. `communicate` 利益相关者更新 (阶段:沟通) 每个阶段贡献总 Episode 得分的加权部分。 ## 🔭 Observation Space ``` class IncidentObservation(Observation): task: str # "classify" | "diagnose" | "resolve" incident_id: str # e.g. "INC-2024-001" scenario_index: int # which scenario is active title: str # one-line incident title description: str # detailed incident description alerts: List[Dict] # list of alert dicts logs: List[str] # log lines revealed so far metrics: Dict[str, Any] # metric snapshots revealed so far step: int # current step (1-indexed) max_steps: int # episode step budget phase: str # current phase available_actions: List[str] # valid action_types at this step feedback: str # feedback from the last action score_so_far: float # cumulative score [0.0, 1.0] done: bool # inherited — episode ended? reward: float # inherited — last step reward ``` ## 🕹️ Action Space ``` class IncidentAction(Action): action_type: str # REQUIRED — one of the types below # classify fields severity: Optional[str] # "P0" | "P1" | "P2" | "P3" incident_type: Optional[str] # "availability" | "performance" | "data_corruption" # | "security" | "configuration" | "resource_exhaustion" affected_service: Optional[str] # e.g. "payment-service" # investigation fields query: Optional[str] # service name or keyword for log/metric queries time_range_minutes: Optional[int] # diagnose fields root_cause: Optional[str] # short slug, e.g. "connection_pool_exhaustion" affected_components: Optional[List[str]] evidence: Optional[List[str]] confidence: Optional[float] # 0.0–1.0 # remediate fields remediation_steps: Optional[List[str]] estimated_resolution_minutes: Optional[int] # communicate fields audience: Optional[str] # "engineering" | "management" | "customers" | "all" message: Optional[str] ``` ## 🏆 Reward Function 每个 Episode 的所有奖励范围在 `[0.0, 1.0]` 之间。 ### classify task (max 1.0) | Component | Weight | Criteria | |-----------|--------|----------| | severity | 0.40 | Exact=1.0, ±1 level=0.6, ±2 levels=0.2 | | incident_type | 0.30 | Substring match | | affected_service | 0.30 | Substring match | ### diagnose task (max 1.0) | Component | Weight | Criteria | |-----------|--------|----------| | root_cause | 0.40 | Substring match against ground truth slug | | affected_components | 0.20 | Overlap with expected component list | | evidence quality | 0.30 | Keyword coverage of expected evidence tokens | | step efficiency | 0.10 | Full points for ≤1 investigation step; −0.05 per extra step | ### resolve task (max 1.0) | Phase | Weight | Grader | |-------|--------|--------| | classify | 0.20 | Same as classify task | | diagnose | 0.30 | Same as diagnose task | | remediate | 0.30 | Keyword coverage of expected fix actions (min 2 steps) | | communicate | 0.20 | Keyword coverage of expected message content + audience + length | resolve 中的调查步骤贡献 0 直接奖励,但会解锁 log/metric 数据。 ## 📦 设置 ### 先决条件 - Python 3.11+ - Docker - `pip install openenv-core` ### 本地服务器 (无 Docker) ``` cd openenv-ir pip install -r server/requirements.txt uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload ``` ### Docker ``` # 从 repo root 构建 docker build -t incident-response-env:latest -f server/Dockerfile . # 运行 docker run -p 8000:8000 incident-response-env:latest # 验证 openenv validate --url http://localhost:8000 ``` ## 🤖 运行推理脚本 ``` export HF_TOKEN="hf_..." export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" # 指向运行中的服务器 export IR_ENV_URL="http://localhost:8000" cd openenv-ir python inference.py ``` Or using Docker: ``` export IMAGE_NAME="incident-response-env:latest" # (保持 IR_ENV_URL 未设置 — inference.py 自动启动 Docker) python inference.py ``` ## 📊 基准分数 Measured with `Qwen/Qwen2.5-72B-Instruct` via HuggingFace router (expert-level scripted agent): | Task | Difficulty | Score | Success | |------|-----------|-------|---------| | classify | 🟢 Easy | **1.000** | ✅ | | diagnose | 🟡 Medium | **0.946** | ✅ | | resolve | 🔴 Hard | **0.816** | ✅ | | **mean** | — | **0.921** | — | Bad/random agent scores approximately **0.10** on resolve, confirming the environment is non-trivial. *Re-run `inference.py` against your own model to reproduce.* ## ✅ 提交前检查清单 ``` # 验证本地结构 openenv validate . # 构建 Docker image docker build -t incident-response-env:latest -f server/Dockerfile . # 运行并验证实时服务器 docker run -p 8000:8000 incident-response-env:latest & openenv validate --url http://localhost:8000 # 运行提交前验证脚本 ./validate-submission.sh https://your-space.hf.space . ``` ## 🗂️ 项目结构 ``` openenv-ir/ ├── openenv.yaml ← OpenEnv spec metadata ├── models.py ← IncidentAction + IncidentObservation (shared) ├── client.py ← IncidentResponseEnv WebSocket client ├── __init__.py ├── inference.py ← Baseline agent (classify + diagnose + resolve) ├── README.md └── server/ ├── Dockerfile ← Container definition ├── requirements.txt ├── app.py ← FastAPI app (create_app wrapper) ├── ir_environment.py ← Core state machine ├── tasks.py ← Scenario definitions + ground truth └── graders.py ← Deterministic scoring functions ``` ## 🔌 HuggingFace Space 部署 1. 创建一个新的 HF Space,选择 **Docker** SDK 和 `openenv` 标签。 2. 将此 Repo 推送到 Space。 3. HF 将构建 Dockerfile 并暴露端口 8000。 4. 将验证脚本指向 `https://-incident-response-env.hf.space`。 ``` ./validate-submission.sh https://.hf.space . ``` ## 📄 许可证 Released under the MIT License.
标签:AIOps, Docker, IT事件响应, OpenEnv, SRE, 事件分类, 偏差过滤, 告警管理, 安全防御评估, 强化学习环境, 故障诊断, 智能体训练, 根因分析, 模块化设计, 沙箱环境, 特征库, 生产事故模拟, 站点可靠性工程, 自动化修复, 请求拦截, 运维自动化, 逆向工具