codergithub07/incident-response-env
GitHub: codergithub07/incident-response-env
一个符合 OpenEnv 标准的强化学习环境,用于训练和评估 AI Agent 模拟 SRE 工程师完成 IT 事故的分流、诊断、修复和沟通全流程。
Stars: 0 | Forks: 0
title: IT Incident Response OpenEnv
emoji: 🚨
colorFrom: red
colorTo: red
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
- reinforcement-learning
- agent-training
- sre
- incident-response
license: mit
# 🚨 IT Incident Response — OpenEnv Environment
一个符合 OpenEnv 标准的 RL 环境,模拟了值班 Site Reliability Engineer (SRE) 处理生产事故的真实工作。
Agent 必须对事故进行分流、调查、诊断、修复和沟通 —— 就像人类工程师一样 —— 并根据每个操作的质量和准确性获得相应的奖励。
## 🎯 动机
每家科技公司都会因生产事故而损失工程时间和收入。训练 AI Agent 辅助事故响应是 Agentic RL 最具商业价值的应用之一。该环境为此类训练提供了一个可控、可复现的沙盒,包含逼真的场景、部分进度奖励和具有挑战性的前沿模型难度级别。
## 📋 任务描述
| Task | Difficulty | Max Steps | Description |
|------|-----------|-----------|-------------|
| `classify` | 🟢 Easy | 1 | 根据事故警报,分配严重等级 (P0–P3)、事故类型和受影响服务 |
| `diagnose` | 🟡 Medium | 5 | 调查日志和指标,然后识别根本原因及支持证据 |
| `resolve` | 🔴 Hard | 10 | 完整流程:分类 → 调查 → 诊断 → 修复 → 沟通 |
### 任务详情
**classify** — Agent 接收一个标题、描述和一组监控警报。它必须在单次操作中发出 `{"action_type": "classify", "severity": "P1", "incident_type": "availability", "affected_service": "payment-service"}`。每个字段均有部分分,严重等级允许一级偏差的容错。
**diagnose** — Agent 看到相同的初始上下文以及更深入的信号。它可以使用 `query_logs` 和 `query_metrics` 来检索相关的日志行和指标快照,然后必须提交一个包含 `root_cause`、`affected_components` 和 `evidence` 的 `diagnose` 操作。调查步骤越少 → 效率奖励越高。
**resolve** — 一个多阶段 Episode。Agent 必须按顺序执行:
1. `classify` 事故 (阶段:分流)
2. (可选) 通过 `query_logs` / `query_metrics` 进行调查 (阶段:调查)
3. `diagnose` 根本原因 (阶段:诊断)
4. `remediate` 并制定分步修复计划 (阶段:修复)
5. `communicate` 利益相关者更新 (阶段:沟通)
每个阶段贡献总 Episode 得分的加权部分。
## 🔭 Observation Space
```
class IncidentObservation(Observation):
task: str # "classify" | "diagnose" | "resolve"
incident_id: str # e.g. "INC-2024-001"
scenario_index: int # which scenario is active
title: str # one-line incident title
description: str # detailed incident description
alerts: List[Dict] # list of alert dicts
logs: List[str] # log lines revealed so far
metrics: Dict[str, Any] # metric snapshots revealed so far
step: int # current step (1-indexed)
max_steps: int # episode step budget
phase: str # current phase
available_actions: List[str] # valid action_types at this step
feedback: str # feedback from the last action
score_so_far: float # cumulative score [0.0, 1.0]
done: bool # inherited — episode ended?
reward: float # inherited — last step reward
```
## 🕹️ Action Space
```
class IncidentAction(Action):
action_type: str # REQUIRED — one of the types below
# classify fields
severity: Optional[str] # "P0" | "P1" | "P2" | "P3"
incident_type: Optional[str] # "availability" | "performance" | "data_corruption"
# | "security" | "configuration" | "resource_exhaustion"
affected_service: Optional[str] # e.g. "payment-service"
# investigation fields
query: Optional[str] # service name or keyword for log/metric queries
time_range_minutes: Optional[int]
# diagnose fields
root_cause: Optional[str] # short slug, e.g. "connection_pool_exhaustion"
affected_components: Optional[List[str]]
evidence: Optional[List[str]]
confidence: Optional[float] # 0.0–1.0
# remediate fields
remediation_steps: Optional[List[str]]
estimated_resolution_minutes: Optional[int]
# communicate fields
audience: Optional[str] # "engineering" | "management" | "customers" | "all"
message: Optional[str]
```
## 🏆 Reward Function
每个 Episode 的所有奖励范围在 `[0.0, 1.0]` 之间。
### classify task (max 1.0)
| Component | Weight | Criteria |
|-----------|--------|----------|
| severity | 0.40 | Exact=1.0, ±1 level=0.6, ±2 levels=0.2 |
| incident_type | 0.30 | Substring match |
| affected_service | 0.30 | Substring match |
### diagnose task (max 1.0)
| Component | Weight | Criteria |
|-----------|--------|----------|
| root_cause | 0.40 | Substring match against ground truth slug |
| affected_components | 0.20 | Overlap with expected component list |
| evidence quality | 0.30 | Keyword coverage of expected evidence tokens |
| step efficiency | 0.10 | Full points for ≤1 investigation step; −0.05 per extra step |
### resolve task (max 1.0)
| Phase | Weight | Grader |
|-------|--------|--------|
| classify | 0.20 | Same as classify task |
| diagnose | 0.30 | Same as diagnose task |
| remediate | 0.30 | Keyword coverage of expected fix actions (min 2 steps) |
| communicate | 0.20 | Keyword coverage of expected message content + audience + length |
resolve 中的调查步骤贡献 0 直接奖励,但会解锁 log/metric 数据。
## 📦 设置
### 先决条件
- Python 3.11+
- Docker
- `pip install openenv-core`
### 本地服务器 (无 Docker)
```
cd openenv-ir
pip install -r server/requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
```
### Docker
```
# 从 repo root 构建
docker build -t incident-response-env:latest -f server/Dockerfile .
# 运行
docker run -p 8000:8000 incident-response-env:latest
# 验证
openenv validate --url http://localhost:8000
```
## 🤖 运行推理脚本
```
export HF_TOKEN="hf_..."
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
# 指向运行中的服务器
export IR_ENV_URL="http://localhost:8000"
cd openenv-ir
python inference.py
```
Or using Docker:
```
export IMAGE_NAME="incident-response-env:latest"
# (保持 IR_ENV_URL 未设置 — inference.py 自动启动 Docker)
python inference.py
```
## 📊 基准分数
Measured with `Qwen/Qwen2.5-72B-Instruct` via HuggingFace router (expert-level scripted agent):
| Task | Difficulty | Score | Success |
|------|-----------|-------|---------|
| classify | 🟢 Easy | **1.000** | ✅ |
| diagnose | 🟡 Medium | **0.946** | ✅ |
| resolve | 🔴 Hard | **0.816** | ✅ |
| **mean** | — | **0.921** | — |
Bad/random agent scores approximately **0.10** on resolve, confirming the environment is non-trivial.
*Re-run `inference.py` against your own model to reproduce.*
## ✅ 提交前检查清单
```
# 验证本地结构
openenv validate .
# 构建 Docker image
docker build -t incident-response-env:latest -f server/Dockerfile .
# 运行并验证实时服务器
docker run -p 8000:8000 incident-response-env:latest &
openenv validate --url http://localhost:8000
# 运行提交前验证脚本
./validate-submission.sh https://your-space.hf.space .
```
## 🗂️ 项目结构
```
openenv-ir/
├── openenv.yaml ← OpenEnv spec metadata
├── models.py ← IncidentAction + IncidentObservation (shared)
├── client.py ← IncidentResponseEnv WebSocket client
├── __init__.py
├── inference.py ← Baseline agent (classify + diagnose + resolve)
├── README.md
└── server/
├── Dockerfile ← Container definition
├── requirements.txt
├── app.py ← FastAPI app (create_app wrapper)
├── ir_environment.py ← Core state machine
├── tasks.py ← Scenario definitions + ground truth
└── graders.py ← Deterministic scoring functions
```
## 🔌 HuggingFace Space 部署
1. 创建一个新的 HF Space,选择 **Docker** SDK 和 `openenv` 标签。
2. 将此 Repo 推送到 Space。
3. HF 将构建 Dockerfile 并暴露端口 8000。
4. 将验证脚本指向 `https://-incident-response-env.hf.space`。
```
./validate-submission.sh https://.hf.space .
```
## 📄 许可证
Released under the MIT License.
标签:AIOps, Docker, IT事件响应, OpenEnv, SRE, 事件分类, 偏差过滤, 告警管理, 安全防御评估, 强化学习环境, 故障诊断, 智能体训练, 根因分析, 模块化设计, 沙箱环境, 特征库, 生产事故模拟, 站点可靠性工程, 自动化修复, 请求拦截, 运维自动化, 逆向工具