RehanNaveid/incident-response-env

GitHub: RehanNaveid/incident-response-env

这是一个面向SRE事故响应的强化学习模拟环境，旨在让AI代理基于遥测数据训练并评估其诊断推理及自动化修复能力。

Stars: 1 | Forks: 0

## title: IncidentIQ emoji: 🤖 colorFrom: blue colorTo: green sdk: docker sdk_version: "latest" python_version: "3.12" app_file: app.py pinned: false # IncidentIQ — SRE Incident Response Environment [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-brightgreen)](https://openenv.dev) [![HuggingFace](https://img.shields.io/badge/🤗-HuggingFace%20Space-yellow)](https://huggingface.co) [![Docker](https://img.shields.io/badge/Docker-ready-blue)](https://www.docker.com) ## 描述与动机 Site Reliability Engineers 在时间压力下响应生产事故，根据不完整的信息（日志、指标和服务依赖关系）诊断根本原因，并在 SLA 窗口到期前应用针对性的修复方案。这是真实工程师日常执行的结构化推理任务，也是 AI 代理日益被期望协助的场景。 **IncidentIQ** 将此任务模拟为强化学习环境。代理接收实时系统遥测数据，必须根据日志证据推理根本原因，并执行正确的操作序列以解决事故。该环境模拟了真实的故障模式： - Connection pool exhaustion - Cascading config deploys - Memory leaks - Certificate expiry - Upstream rate limiting - Database degradation 在此环境上训练代理可以培养结构化诊断推理、基于证据的决策以及在约束条件下进行序列规划的能力——这些技能可以直接迁移到真实的 SRE 自动化中。 ## Action Space 动作是**自由文本命令**。环境从文本中解析意图。 | Format | Description | |---|---| | `investigate ` | 检查特定服务。服务名称必须与 **AFFECTED SERVICES** 完全匹配。 | | `assign to ` | 将事故分配给一个团队。团队名称必须与 **TEAM ROSTER** 完全匹配。 | | `mitigate: ` | 应用针对性的修复。修复关键词必须出现在 `ERROR` 或 `CRIT` 日志中。 | | `escalate` | 升级事故严重性。 | | `resolve` | 关闭事故。仅在确认缓解后有效。 | **执行规则：** - 缓解操作需要事先调查至少一个受影响的服务。 - 解决操作需要确认已缓解。 - 重复相同的动作会导致惩罚。 ## Observation Space 每次调用 `step()` 都会返回一个包含以下字段的 `IncidentObservation` 对象： | Field | Type | Description | |---|---|---| | `task_id` | `string` | 任务标识符 | | `incident_description` | `string` | 人类可读的事故标题 | | `affected_services` | `list[string]` | 涉及事故的服务 | | `severity` | `P0 \| P1 \| P2` | 事故严重性级别 | | `logs` | `list[string]`` | 最后 8 条带有 ISO 8601 时间戳的日志行 | | `metrics` | `list[object]` | 每个服务的 `error_rate_pct`、`latency_p99_ms`、`throughput_rps`、`status` | | `feedback` | `string`` | 上一次操作的自然语言结果 | | `reward` | `float` | 上一次操作的步骤奖励 | | `score` | `float [0,1]` | 累积归一化的回合进度 | | `sla_remaining` | `int`` | SLA 违规前剩余的分钟数 | | `team_roster` | `dict` | `team_name → available \| busy` | ## 任务 ### Task 1 — 单一服务故障 *(简单，最多 10 步)* 一个服务停机，根本原因在日志中清晰可见。代理必须调查受影响的服务，分配给正确的团队，应用错误日志中匹配的缓解关键词，然后解决。一个有能力的代理应该能在 4–5 步内完成。 ### Task 2 — 级联故障 *(中等，最多 18 步)* 在配置部署后，三个服务相继发生故障。代理必须调查全部三个服务，识别根本原因服务，应用正确的回滚，并在紧张的 SLA 到期前解决问题。需要在进行缓解操作前进行系统性的调查。 ### Task 3 — 模糊的支付性能下降 *(困难，最多 25 步)* 支付服务出现性能下降，日志中有三种看似合理的根本原因和两个故意的误导性线索。代理必须调查多个假设域（upstream rate limiting、数据库问题、资源耗尽），识别真实原因，并应用正确的缓解措施。操作中可选的 `reasoning` 字段将被评估以给予部分学分，这是唯一奖励思维链（chain-of-thought）而不仅仅是操作的任务。 ## 奖励设计奖励在整个回合中是**密集且连续**的： | Action | Reward | |---|---| | Investigate correct service | `+0.15` 到 `+0.23` | | Assign correct team | `+0.10` | | Apply correct mitigation keywords | `+0.20` 到 `+0.52` | | Mitigate without investigating first | `−0.25` | | Resolve before mitigation confirmed | `−0.40` | | Repeated action | `−0.20` | | Unknown action | `−0.20` | | SLA breach (past 60% of window) | 渐进式衰减 | 最终回合分数由应用于完整操作历史的**确定性评分器**计算得出。评分器会检查调查的完整性、团队分配的准确性、缓解关键词的匹配度、解决情况以及效率。所有分数均在 `[0.0, 1.0]` 范围内。 ## 基线结果使用 `seed=42`、模型 `openai/gpt-4o-mini` 运行，prompt 中无特定任务提示。 | Task | Difficulty | Steps Used | Score | |---|---|---|---| | `single_service_outage` | Easy | 7 | **0.85** | | `cascading_failure` | Medium | 7 | **0.96** | | `ambiguous_payment_degradation` | Hard | 10 | **0.99** | | **Average** | | **8.0** | **0.93** |

Full inference log

``` [START] task=single_service_outage env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate payment-service reward=0.25 done=false error=null [STEP] step=2 action=assign to payments-oncall reward=0.10 done=false error=null [STEP] step=3 action=mitigate: increase memory limit for payment-service pods reward=0.32 done=false error=null [STEP] step=4 action=mitigate: increase memory limit for payment-service pods reward=0.40 done=false error=null [STEP] step=5 action=mitigate: increase memory limit for payment-service pods reward=-0.26 done=false error=null [STEP] step=6 action=mitigate: increase memory limit for payment-service pods reward=-0.39 done=false error=null [STEP] step=7 action=resolve reward=0.30 done=true error=null [END] success=true steps=7 score=0.85 rewards=0.25,0.10,0.32,0.40,-0.26,-0.39,0.30 [START] task=cascading_failure env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate auth-service reward=0.25 done=false error=null [STEP] step=2 action=investigate api-gateway reward=0.22 done=false error=null [STEP] step=3 action=investigate user-service reward=0.19 done=false error=null [STEP] step=4 action=mitigate: rollback config token_verifier_v2 in auth-service reward=0.42 done=false error=null [STEP] step=5 action=mitigate: rollback config token_verifier_v2 in auth-service reward=0.42 done=false error=null [STEP] step=6 action=mitigate: rollback config token_verifier_v2 in auth-service reward=-0.26 done=false error=null [STEP] step=7 action=resolve reward=0.28 done=true error=null [END] success=true steps=7 score=0.96 rewards=0.25,0.22,0.19,0.42,0.42,-0.26,0.28 [START] task=ambiguous_payment_degradation env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate payment-service reward=-0.15 done=false error=null [STEP] step=2 action=investigate payment-service db connection reward=-0.68 done=false error=null [STEP] step=3 action=investigate payment-service rate limit reward=-0.46 done=false error=null [STEP] step=4 action=investigate payment-service memory heap reward=-0.34 done=false error=null [STEP] step=5 action=assign to payments-oncall reward=0.10 done=false error=null [STEP] step=6 action=mitigate: increase memory allocation for payment-service reward=0.52 done=false error=null [STEP] step=7 action=mitigate: increase memory allocation for payment-service reward=0.61 done=false error=null [STEP] step=8 action=mitigate: increase memory allocation for payment-service reward=-0.07 done=false error=null [STEP] step=9 action=mitigate: increase memory allocation for payment-service reward=-0.40 done=false error=null [STEP] step=10 action=resolve reward=0.30 done=true error=null [END] success=true steps=10 score=0.99 rewards=-0.15,-0.68,-0.46,-0.34,0.10,0.52,0.61,-0.07,-0.40,0.30 [SUMMARY] tasks=3 avg_score=0.9308 total_steps=24 all_success=True single_service_outage [████████████████░░░░] 0.8460 (7 steps) cascading_failure [███████████████████░] 0.9587 (7 steps) ambiguous_payment_degradation [███████████████████░] 0.9878 (10 steps) ```

## 设置与使用 ### Requirements ``` pip install -r requirements.txt ``` ### 环境变量 ``` export API_BASE_URL="https://your-llm-endpoint/v1" export MODEL_NAME="your-model-name" export HF_TOKEN="your-api-key" # 用于本地测试 export ENV_URL="http://localhost:7860" # 用于 Hugging Face 部署（由评估器使用） export ENV_URL="https://.hf.space" # default; only change if server runs elsewhere ``` ### 运行 ``` # Terminal 1 — 启动环境服务器 python app.py # Terminal 2 — 运行推理脚本 python inference.py ``` ### 运行单个任务 ``` TASK_IDS_OVERRIDE=single_service_outage python inference.py ``` ### Docker ``` docker build -t incidentiq . docker run -p 7860:7860 \ -e API_BASE_URL=$API_BASE_URL \ -e MODEL_NAME=$MODEL_NAME \ -e HF_TOKEN=$HF_TOKEN \ incidentiq ``` ### 验证环境 ``` curl -s -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "single_service_outage", "seed": 42}' \ | python -m json.tool ``` ## API Endpoints | Endpoint | Method | Description | |---|---|---| | `/reset` | POST | 开始新回合。Body: `{"task_id": "...", "seed": 42}` | | `/step` | POST | 发送一个操作。Body: `{"action": {"action": "..."}}` | | `/state` | GET | 获取当前回合状态，包括用于评分的 ground truth | | `/incident-meta` | GET | 获取评分器使用的事事故元数据 | | `/runbook` | GET | 获取受影响服务的诊断提示 | | `/tasks` | GET | 列出所有可用任务 | | `/health` | GET | 服务器健康检查 | ## OpenEnv Compliance - 完整的 `step()` / `reset()` / `state()` 实现 - 所有操作和观察的 Pydantic 类型模型 - 确定性评分器，无 LLM 调用 - 使用固定种子可复现 - 整个轨迹上的密集奖励 - 用于并行评估运行的会话隔离 ## 项目结构 ``` incident-response-env/ ├── server/ # Core environment server (FastAPI + OpenEnv) │ ├── __init__.py │ ├── app.py # FastAPI app exposing /reset, /step, /state, /health │ ├── environment.py # Main environment logic (step/reset/state, reward) │ ├── incidents.py # Deterministic incident generator (seed-based) │ ├── simulator.py # Dynamic simulation (logs + metrics evolution) │ ├── tasks.py # Task configs (difficulty, max_steps, rewards) │ ├── models.py # Pydantic models (Action, Observation, State) ├── inference.py # Baseline agent (OpenAI-compatible client) ├── client.py # Optional client helper for interacting with env │ ├── openenv.yaml # OpenEnv metadata (tasks, spaces, entrypoint) ├── Dockerfile # Container setup for HF Spaces deployment ├── pyproject.toml # Project config (used by uv) ├── uv.lock # Dependency lock file (reproducible builds) ├── requirements.txt # Python dependencies (fallback install) │ ├── .env # Local environment variables (not committed) ├── .env.example # Template for required env variables ├── .gitignore # Ignore rules │ ├── validate-submission.sh # Pre-submission validation script ├── README.md # Project documentation │ ├── venv/ or .venv/ # Virtual environment (local only, ignored) └── __pycache__/ # Python cache (auto-generated) ``` 推理脚本使用通过 API_BASE_URL 和 MODEL_NAME 配置的 OpenAI 兼容客户端接口。 ## License MIT

标签：Agent, AIOps, Docker, Hugging Face, OpenEnv, Petitpotam, Python, Site Reliability Engineering, SRE, 仿真环境, 偏差过滤, 内存泄漏, 安全防御评估, 强化学习, 故障诊断, 数据库退化, 无后门, 智能运维, 根因分析, 模块化设计, 生产环境, 自动化修复, 证书过期, 请求拦截, 连接池耗尽, 逆向工具, 限流