RehanNaveid/incident-response-env
GitHub: RehanNaveid/incident-response-env
这是一个面向SRE事故响应的强化学习模拟环境,旨在让AI代理基于遥测数据训练并评估其诊断推理及自动化修复能力。
Stars: 0 | Forks: 0
## title: IncidentIQ
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: "latest"
python_version: "3.12"
app_file: app.py
pinned: false
# IncidentIQ — SRE Incident Response Environment
[](https://openenv.dev)
[](https://huggingface.co)
[](https://www.docker.com)
## 描述与动机
Site Reliability Engineers 在时间压力下响应生产事故,根据不完整的信息(日志、指标和服务依赖关系)诊断根本原因,并在 SLA 窗口到期前应用针对性的修复方案。这是真实工程师日常执行的结构化推理任务,也是 AI 代理日益被期望协助的场景。
**IncidentIQ** 将此任务模拟为强化学习环境。代理接收实时系统遥测数据,必须根据日志证据推理根本原因,并执行正确的操作序列以解决事故。该环境模拟了真实的故障模式:
- Connection pool exhaustion
- Cascading config deploys
- Memory leaks
- Certificate expiry
- Upstream rate limiting
- Database degradation
在此环境上训练代理可以培养结构化诊断推理、基于证据的决策以及在约束条件下进行序列规划的能力——这些技能可以直接迁移到真实的 SRE 自动化中。
## Action Space
动作是**自由文本命令**。环境从文本中解析意图。
| Format | Description |
|---|---|
| `investigate ` | 检查特定服务。服务名称必须与 **AFFECTED SERVICES** 完全匹配。 |
| `assign to ` | 将事故分配给一个团队。团队名称必须与 **TEAM ROSTER** 完全匹配。 |
| `mitigate: ` | 应用针对性的修复。修复关键词必须出现在 `ERROR` 或 `CRIT` 日志中。 |
| `escalate` | 升级事故严重性。 |
| `resolve` | 关闭事故。仅在确认缓解后有效。 |
**执行规则:**
- 缓解操作需要事先调查至少一个受影响的服务。
- 解决操作需要确认已缓解。
- 重复相同的动作会导致惩罚。
## Observation Space
每次调用 `step()` 都会返回一个包含以下字段的 `IncidentObservation` 对象:
| Field | Type | Description |
|---|---|---|
| `task_id` | `string` | 任务标识符 |
| `incident_description` | `string` | 人类可读的事故标题 |
| `affected_services` | `list[string]` | 涉及事故的服务 |
| `severity` | `P0 \| P1 \| P2` | 事故严重性级别 |
| `logs` | `list[string]`` | 最后 8 条带有 ISO 8601 时间戳的日志行 |
| `metrics` | `list[object]` | 每个服务的 `error_rate_pct`、`latency_p99_ms`、`throughput_rps`、`status` |
| `feedback` | `string`` | 上一次操作的自然语言结果 |
| `reward` | `float` | 上一次操作的步骤奖励 |
| `score` | `float [0,1]` | 累积归一化的回合进度 |
| `sla_remaining` | `int`` | SLA 违规前剩余的分钟数 |
| `team_roster` | `dict` | `team_name → available \| busy` |
## 任务
### Task 1 — 单一服务故障 *(简单,最多 10 步)*
一个服务停机,根本原因在日志中清晰可见。代理必须调查受影响的服务,分配给正确的团队,应用错误日志中匹配的缓解关键词,然后解决。一个有能力的代理应该能在 4–5 步内完成。
### Task 2 — 级联故障 *(中等,最多 18 步)*
在配置部署后,三个服务相继发生故障。代理必须调查全部三个服务,识别根本原因服务,应用正确的回滚,并在紧张的 SLA 到期前解决问题。需要在进行缓解操作前进行系统性的调查。
### Task 3 — 模糊的支付性能下降 *(困难,最多 25 步)*
支付服务出现性能下降,日志中有三种看似合理的根本原因和两个故意的误导性线索。代理必须调查多个假设域(upstream rate limiting、数据库问题、资源耗尽),识别真实原因,并应用正确的缓解措施。操作中可选的 `reasoning` 字段将被评估以给予部分学分,这是唯一奖励思维链(chain-of-thought)而不仅仅是操作的任务。
## 奖励设计
奖励在整个回合中是**密集且连续**的:
| Action | Reward |
|---|---|
| Investigate correct service | `+0.15` 到 `+0.23` |
| Assign correct team | `+0.10` |
| Apply correct mitigation keywords | `+0.20` 到 `+0.52` |
| Mitigate without investigating first | `−0.25` |
| Resolve before mitigation confirmed | `−0.40` |
| Repeated action | `−0.20` |
| Unknown action | `−0.20` |
| SLA breach (past 60% of window) | 渐进式衰减 |
最终回合分数由应用于完整操作历史的**确定性评分器**计算得出。评分器会检查调查的完整性、团队分配的准确性、缓解关键词的匹配度、解决情况以及效率。所有分数均在 `[0.0, 1.0]` 范围内。
## 基线结果
使用 `seed=42`、模型 `openai/gpt-4o-mini` 运行,prompt 中无特定任务提示。
| Task | Difficulty | Steps Used | Score |
|---|---|---|---|
| `single_service_outage` | Easy | 7 | **0.85** |
| `cascading_failure` | Medium | 7 | **0.96** |
| `ambiguous_payment_degradation` | Hard | 10 | **0.99** |
| **Average** | | **8.0** | **0.93** |
## 设置与使用
### Requirements
```
pip install -r requirements.txt
```
### 环境变量
```
export API_BASE_URL="https://your-llm-endpoint/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-api-key"
# 用于本地测试
export ENV_URL="http://localhost:7860"
# 用于 Hugging Face 部署(由评估器使用)
export ENV_URL="https://.hf.space" # default; only change if server runs elsewhere
```
### 运行
```
# Terminal 1 — 启动环境服务器
python app.py
# Terminal 2 — 运行推理脚本
python inference.py
```
### 运行单个任务
```
TASK_IDS_OVERRIDE=single_service_outage python inference.py
```
### Docker
```
docker build -t incidentiq .
docker run -p 7860:7860 \
-e API_BASE_URL=$API_BASE_URL \
-e MODEL_NAME=$MODEL_NAME \
-e HF_TOKEN=$HF_TOKEN \
incidentiq
```
### 验证环境
```
curl -s -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "single_service_outage", "seed": 42}' \
| python -m json.tool
```
## API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/reset` | POST | 开始新回合。Body: `{"task_id": "...", "seed": 42}` |
| `/step` | POST | 发送一个操作。Body: `{"action": {"action": "..."}}` |
| `/state` | GET | 获取当前回合状态,包括用于评分的 ground truth |
| `/incident-meta` | GET | 获取评分器使用的事事故元数据 |
| `/runbook` | GET | 获取受影响服务的诊断提示 |
| `/tasks` | GET | 列出所有可用任务 |
| `/health` | GET | 服务器健康检查 |
## OpenEnv Compliance
- 完整的 `step()` / `reset()` / `state()` 实现
- 所有操作和观察的 Pydantic 类型模型
- 确定性评分器,无 LLM 调用
- 使用固定种子可复现
- 整个轨迹上的密集奖励
- 用于并行评估运行的会话隔离
## 项目结构
```
incident-response-env/
├── server/ # Core environment server (FastAPI + OpenEnv)
│ ├── __init__.py
│ ├── app.py # FastAPI app exposing /reset, /step, /state, /health
│ ├── environment.py # Main environment logic (step/reset/state, reward)
│ ├── incidents.py # Deterministic incident generator (seed-based)
│ ├── simulator.py # Dynamic simulation (logs + metrics evolution)
│ ├── tasks.py # Task configs (difficulty, max_steps, rewards)
│
├── models.py # Pydantic models (Action, Observation, State)
├── inference.py # Baseline agent (OpenAI-compatible client)
├── client.py # Optional client helper for interacting with env
│
├── openenv.yaml # OpenEnv metadata (tasks, spaces, entrypoint)
├── Dockerfile # Container setup for HF Spaces deployment
├── pyproject.toml # Project config (used by uv)
├── uv.lock # Dependency lock file (reproducible builds)
├── requirements.txt # Python dependencies (fallback install)
│
├── .env # Local environment variables (not committed)
├── .env.example # Template for required env variables
├── .gitignore # Ignore rules
│
├── validate-submission.sh # Pre-submission validation script
├── README.md # Project documentation
│
├── venv/ or .venv/ # Virtual environment (local only, ignored)
└── __pycache__/ # Python cache (auto-generated)
```
推理脚本使用通过 API_BASE_URL 和 MODEL_NAME 配置的 OpenAI 兼容客户端接口。
## License
MIT
Full inference log
``` [START] task=single_service_outage env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate payment-service reward=0.25 done=false error=null [STEP] step=2 action=assign to payments-oncall reward=0.10 done=false error=null [STEP] step=3 action=mitigate: increase memory limit for payment-service pods reward=0.32 done=false error=null [STEP] step=4 action=mitigate: increase memory limit for payment-service pods reward=0.40 done=false error=null [STEP] step=5 action=mitigate: increase memory limit for payment-service pods reward=-0.26 done=false error=null [STEP] step=6 action=mitigate: increase memory limit for payment-service pods reward=-0.39 done=false error=null [STEP] step=7 action=resolve reward=0.30 done=true error=null [END] success=true steps=7 score=0.85 rewards=0.25,0.10,0.32,0.40,-0.26,-0.39,0.30 [START] task=cascading_failure env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate auth-service reward=0.25 done=false error=null [STEP] step=2 action=investigate api-gateway reward=0.22 done=false error=null [STEP] step=3 action=investigate user-service reward=0.19 done=false error=null [STEP] step=4 action=mitigate: rollback config token_verifier_v2 in auth-service reward=0.42 done=false error=null [STEP] step=5 action=mitigate: rollback config token_verifier_v2 in auth-service reward=0.42 done=false error=null [STEP] step=6 action=mitigate: rollback config token_verifier_v2 in auth-service reward=-0.26 done=false error=null [STEP] step=7 action=resolve reward=0.28 done=true error=null [END] success=true steps=7 score=0.96 rewards=0.25,0.22,0.19,0.42,0.42,-0.26,0.28 [START] task=ambiguous_payment_degradation env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate payment-service reward=-0.15 done=false error=null [STEP] step=2 action=investigate payment-service db connection reward=-0.68 done=false error=null [STEP] step=3 action=investigate payment-service rate limit reward=-0.46 done=false error=null [STEP] step=4 action=investigate payment-service memory heap reward=-0.34 done=false error=null [STEP] step=5 action=assign to payments-oncall reward=0.10 done=false error=null [STEP] step=6 action=mitigate: increase memory allocation for payment-service reward=0.52 done=false error=null [STEP] step=7 action=mitigate: increase memory allocation for payment-service reward=0.61 done=false error=null [STEP] step=8 action=mitigate: increase memory allocation for payment-service reward=-0.07 done=false error=null [STEP] step=9 action=mitigate: increase memory allocation for payment-service reward=-0.40 done=false error=null [STEP] step=10 action=resolve reward=0.30 done=true error=null [END] success=true steps=10 score=0.99 rewards=-0.15,-0.68,-0.46,-0.34,0.10,0.52,0.61,-0.07,-0.40,0.30 [SUMMARY] tasks=3 avg_score=0.9308 total_steps=24 all_success=True single_service_outage [████████████████░░░░] 0.8460 (7 steps) cascading_failure [███████████████████░] 0.9587 (7 steps) ambiguous_payment_degradation [███████████████████░] 0.9878 (10 steps) ```标签:Agent, AIOps, Docker, Hugging Face, OpenEnv, Petitpotam, Python, Site Reliability Engineering, SRE, 仿真环境, 偏差过滤, 内存泄漏, 安全防御评估, 强化学习, 故障诊断, 数据库退化, 无后门, 智能运维, 根因分析, 模块化设计, 生产环境, 自动化修复, 证书过期, 请求拦截, 连接池耗尽, 逆向工具, 限流