RehanNaveid/incident-response-env

GitHub: RehanNaveid/incident-response-env

这是一个面向SRE事故响应的强化学习模拟环境,旨在让AI代理基于遥测数据训练并评估其诊断推理及自动化修复能力。

Stars: 0 | Forks: 0

## title: IncidentIQ emoji: 🤖 colorFrom: blue colorTo: green sdk: docker sdk_version: "latest" python_version: "3.12" app_file: app.py pinned: false # IncidentIQ — SRE Incident Response Environment [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-brightgreen)](https://openenv.dev) [![HuggingFace](https://img.shields.io/badge/🤗-HuggingFace%20Space-yellow)](https://huggingface.co) [![Docker](https://img.shields.io/badge/Docker-ready-blue)](https://www.docker.com) ## 描述与动机 Site Reliability Engineers 在时间压力下响应生产事故,根据不完整的信息(日志、指标和服务依赖关系)诊断根本原因,并在 SLA 窗口到期前应用针对性的修复方案。这是真实工程师日常执行的结构化推理任务,也是 AI 代理日益被期望协助的场景。 **IncidentIQ** 将此任务模拟为强化学习环境。代理接收实时系统遥测数据,必须根据日志证据推理根本原因,并执行正确的操作序列以解决事故。该环境模拟了真实的故障模式: - Connection pool exhaustion - Cascading config deploys - Memory leaks - Certificate expiry - Upstream rate limiting - Database degradation 在此环境上训练代理可以培养结构化诊断推理、基于证据的决策以及在约束条件下进行序列规划的能力——这些技能可以直接迁移到真实的 SRE 自动化中。 ## Action Space 动作是**自由文本命令**。环境从文本中解析意图。 | Format | Description | |---|---| | `investigate ` | 检查特定服务。服务名称必须与 **AFFECTED SERVICES** 完全匹配。 | | `assign to ` | 将事故分配给一个团队。团队名称必须与 **TEAM ROSTER** 完全匹配。 | | `mitigate: ` | 应用针对性的修复。修复关键词必须出现在 `ERROR` 或 `CRIT` 日志中。 | | `escalate` | 升级事故严重性。 | | `resolve` | 关闭事故。仅在确认缓解后有效。 | **执行规则:** - 缓解操作需要事先调查至少一个受影响的服务。 - 解决操作需要确认已缓解。 - 重复相同的动作会导致惩罚。 ## Observation Space 每次调用 `step()` 都会返回一个包含以下字段的 `IncidentObservation` 对象: | Field | Type | Description | |---|---|---| | `task_id` | `string` | 任务标识符 | | `incident_description` | `string` | 人类可读的事故标题 | | `affected_services` | `list[string]` | 涉及事故的服务 | | `severity` | `P0 \| P1 \| P2` | 事故严重性级别 | | `logs` | `list[string]`` | 最后 8 条带有 ISO 8601 时间戳的日志行 | | `metrics` | `list[object]` | 每个服务的 `error_rate_pct`、`latency_p99_ms`、`throughput_rps`、`status` | | `feedback` | `string`` | 上一次操作的自然语言结果 | | `reward` | `float` | 上一次操作的步骤奖励 | | `score` | `float [0,1]` | 累积归一化的回合进度 | | `sla_remaining` | `int`` | SLA 违规前剩余的分钟数 | | `team_roster` | `dict` | `team_name → available \| busy` | ## 任务 ### Task 1 — 单一服务故障 *(简单,最多 10 步)* 一个服务停机,根本原因在日志中清晰可见。代理必须调查受影响的服务,分配给正确的团队,应用错误日志中匹配的缓解关键词,然后解决。一个有能力的代理应该能在 4–5 步内完成。 ### Task 2 — 级联故障 *(中等,最多 18 步)* 在配置部署后,三个服务相继发生故障。代理必须调查全部三个服务,识别根本原因服务,应用正确的回滚,并在紧张的 SLA 到期前解决问题。需要在进行缓解操作前进行系统性的调查。 ### Task 3 — 模糊的支付性能下降 *(困难,最多 25 步)* 支付服务出现性能下降,日志中有三种看似合理的根本原因和两个故意的误导性线索。代理必须调查多个假设域(upstream rate limiting、数据库问题、资源耗尽),识别真实原因,并应用正确的缓解措施。操作中可选的 `reasoning` 字段将被评估以给予部分学分,这是唯一奖励思维链(chain-of-thought)而不仅仅是操作的任务。 ## 奖励设计 奖励在整个回合中是**密集且连续**的: | Action | Reward | |---|---| | Investigate correct service | `+0.15` 到 `+0.23` | | Assign correct team | `+0.10` | | Apply correct mitigation keywords | `+0.20` 到 `+0.52` | | Mitigate without investigating first | `−0.25` | | Resolve before mitigation confirmed | `−0.40` | | Repeated action | `−0.20` | | Unknown action | `−0.20` | | SLA breach (past 60% of window) | 渐进式衰减 | 最终回合分数由应用于完整操作历史的**确定性评分器**计算得出。评分器会检查调查的完整性、团队分配的准确性、缓解关键词的匹配度、解决情况以及效率。所有分数均在 `[0.0, 1.0]` 范围内。 ## 基线结果 使用 `seed=42`、模型 `openai/gpt-4o-mini` 运行,prompt 中无特定任务提示。 | Task | Difficulty | Steps Used | Score | |---|---|---|---| | `single_service_outage` | Easy | 7 | **0.85** | | `cascading_failure` | Medium | 7 | **0.96** | | `ambiguous_payment_degradation` | Hard | 10 | **0.99** | | **Average** | | **8.0** | **0.93** |
Full inference log ``` [START] task=single_service_outage env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate payment-service reward=0.25 done=false error=null [STEP] step=2 action=assign to payments-oncall reward=0.10 done=false error=null [STEP] step=3 action=mitigate: increase memory limit for payment-service pods reward=0.32 done=false error=null [STEP] step=4 action=mitigate: increase memory limit for payment-service pods reward=0.40 done=false error=null [STEP] step=5 action=mitigate: increase memory limit for payment-service pods reward=-0.26 done=false error=null [STEP] step=6 action=mitigate: increase memory limit for payment-service pods reward=-0.39 done=false error=null [STEP] step=7 action=resolve reward=0.30 done=true error=null [END] success=true steps=7 score=0.85 rewards=0.25,0.10,0.32,0.40,-0.26,-0.39,0.30 [START] task=cascading_failure env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate auth-service reward=0.25 done=false error=null [STEP] step=2 action=investigate api-gateway reward=0.22 done=false error=null [STEP] step=3 action=investigate user-service reward=0.19 done=false error=null [STEP] step=4 action=mitigate: rollback config token_verifier_v2 in auth-service reward=0.42 done=false error=null [STEP] step=5 action=mitigate: rollback config token_verifier_v2 in auth-service reward=0.42 done=false error=null [STEP] step=6 action=mitigate: rollback config token_verifier_v2 in auth-service reward=-0.26 done=false error=null [STEP] step=7 action=resolve reward=0.28 done=true error=null [END] success=true steps=7 score=0.96 rewards=0.25,0.22,0.19,0.42,0.42,-0.26,0.28 [START] task=ambiguous_payment_degradation env=incident_response_env model=openai/gpt-4o-mini [STEP] step=1 action=investigate payment-service reward=-0.15 done=false error=null [STEP] step=2 action=investigate payment-service db connection reward=-0.68 done=false error=null [STEP] step=3 action=investigate payment-service rate limit reward=-0.46 done=false error=null [STEP] step=4 action=investigate payment-service memory heap reward=-0.34 done=false error=null [STEP] step=5 action=assign to payments-oncall reward=0.10 done=false error=null [STEP] step=6 action=mitigate: increase memory allocation for payment-service reward=0.52 done=false error=null [STEP] step=7 action=mitigate: increase memory allocation for payment-service reward=0.61 done=false error=null [STEP] step=8 action=mitigate: increase memory allocation for payment-service reward=-0.07 done=false error=null [STEP] step=9 action=mitigate: increase memory allocation for payment-service reward=-0.40 done=false error=null [STEP] step=10 action=resolve reward=0.30 done=true error=null [END] success=true steps=10 score=0.99 rewards=-0.15,-0.68,-0.46,-0.34,0.10,0.52,0.61,-0.07,-0.40,0.30 [SUMMARY] tasks=3 avg_score=0.9308 total_steps=24 all_success=True single_service_outage [████████████████░░░░] 0.8460 (7 steps) cascading_failure [███████████████████░] 0.9587 (7 steps) ambiguous_payment_degradation [███████████████████░] 0.9878 (10 steps) ```
## 设置与使用 ### Requirements ``` pip install -r requirements.txt ``` ### 环境变量 ``` export API_BASE_URL="https://your-llm-endpoint/v1" export MODEL_NAME="your-model-name" export HF_TOKEN="your-api-key" # 用于本地测试 export ENV_URL="http://localhost:7860" # 用于 Hugging Face 部署(由评估器使用) export ENV_URL="https://.hf.space" # default; only change if server runs elsewhere ``` ### 运行 ``` # Terminal 1 — 启动环境服务器 python app.py # Terminal 2 — 运行推理脚本 python inference.py ``` ### 运行单个任务 ``` TASK_IDS_OVERRIDE=single_service_outage python inference.py ``` ### Docker ``` docker build -t incidentiq . docker run -p 7860:7860 \ -e API_BASE_URL=$API_BASE_URL \ -e MODEL_NAME=$MODEL_NAME \ -e HF_TOKEN=$HF_TOKEN \ incidentiq ``` ### 验证环境 ``` curl -s -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "single_service_outage", "seed": 42}' \ | python -m json.tool ``` ## API Endpoints | Endpoint | Method | Description | |---|---|---| | `/reset` | POST | 开始新回合。Body: `{"task_id": "...", "seed": 42}` | | `/step` | POST | 发送一个操作。Body: `{"action": {"action": "..."}}` | | `/state` | GET | 获取当前回合状态,包括用于评分的 ground truth | | `/incident-meta` | GET | 获取评分器使用的事事故元数据 | | `/runbook` | GET | 获取受影响服务的诊断提示 | | `/tasks` | GET | 列出所有可用任务 | | `/health` | GET | 服务器健康检查 | ## OpenEnv Compliance - 完整的 `step()` / `reset()` / `state()` 实现 - 所有操作和观察的 Pydantic 类型模型 - 确定性评分器,无 LLM 调用 - 使用固定种子可复现 - 整个轨迹上的密集奖励 - 用于并行评估运行的会话隔离 ## 项目结构 ``` incident-response-env/ ├── server/ # Core environment server (FastAPI + OpenEnv) │ ├── __init__.py │ ├── app.py # FastAPI app exposing /reset, /step, /state, /health │ ├── environment.py # Main environment logic (step/reset/state, reward) │ ├── incidents.py # Deterministic incident generator (seed-based) │ ├── simulator.py # Dynamic simulation (logs + metrics evolution) │ ├── tasks.py # Task configs (difficulty, max_steps, rewards) │ ├── models.py # Pydantic models (Action, Observation, State) ├── inference.py # Baseline agent (OpenAI-compatible client) ├── client.py # Optional client helper for interacting with env │ ├── openenv.yaml # OpenEnv metadata (tasks, spaces, entrypoint) ├── Dockerfile # Container setup for HF Spaces deployment ├── pyproject.toml # Project config (used by uv) ├── uv.lock # Dependency lock file (reproducible builds) ├── requirements.txt # Python dependencies (fallback install) │ ├── .env # Local environment variables (not committed) ├── .env.example # Template for required env variables ├── .gitignore # Ignore rules │ ├── validate-submission.sh # Pre-submission validation script ├── README.md # Project documentation │ ├── venv/ or .venv/ # Virtual environment (local only, ignored) └── __pycache__/ # Python cache (auto-generated) ``` 推理脚本使用通过 API_BASE_URL 和 MODEL_NAME 配置的 OpenAI 兼容客户端接口。 ## License MIT
标签:Agent, AIOps, Docker, Hugging Face, OpenEnv, Petitpotam, Python, Site Reliability Engineering, SRE, 仿真环境, 偏差过滤, 内存泄漏, 安全防御评估, 强化学习, 故障诊断, 数据库退化, 无后门, 智能运维, 根因分析, 模块化设计, 生产环境, 自动化修复, 证书过期, 请求拦截, 连接池耗尽, 逆向工具, 限流