aryanosh/devops-incident-response

GitHub: aryanosh/devops-incident-response

这是一个基于 Docker 和 OpenEnv 标准构建的模拟环境，旨在通过包含四个难度等级的微服务故障场景，评估 AI 智能体在 DevOps 故障响应中的根因分析与修复能力。

Stars: 1 | Forks: 0

title: DevOps Incident Response OpenEnv emoji: "🚨" sdk: docker pinned: false app_port: 8000 tags: - openenv - devops - rl-environment - sre - pytorch # DevOps 故障响应 OpenEnv 此环境在多服务调试、依赖链追踪和安全修复这三大技能上对 AI 模型进行压力测试——这也是区分卓越 SRE 与被动响应者的关键。该环境基于 [OpenEnv](https://huggingface.co/openenv) 标准构建，具备确定性评分、严格的 Pydantic schema 和零配置 Docker 部署。 [![HuggingFace Space](https://img.shields.io/badge/🤗%20Live%20Demo-HuggingFace%20Space-yellow)](https://huggingface.co/spaces/aryanosh/devops-incident-response) [![GitHub](https://img.shields.io/badge/GitHub-Source-black)](https://github.com/aryanosh/devops-incident-response) ## 任务概览在真实的 6 服务微服务拓扑中，包含四个难度递增的故障场景。每个任务都要求 Agent 抵制对表面症状的打补丁，并将故障追踪到真正的根本原因。 | Level | Task ID | Scenario | Root Cause | Required Fix | |---|---|---|---|---| | 🟢 Easy | `easy_task` | 单服务崩溃 | `api_gateway` — `service_crash` | `restart_service` | | 🟡 Medium | `medium_task` | 订单服务内存泄漏 | `order_service` — `memory_leak` | `memory_fix` | | 🔴 Hard | `hard_task` | 级联磁盘饱和 | `database` — `disk_full` | `clear_disk` | | 🟣 Expert | `expert_task` | 双重根本原因：DB + 支付失败 | `database` + `payment_service` | `clear_disk` + `drain_connections` | ### 服务依赖图 ``` api_gateway ├── auth_service │ └── user_service │ └── database └── order_service ├── payment_service │ └── database └── database ``` **Hard** 和 **Expert** 任务故意在上游服务（`api_gateway`, `order_service`）上表现出症状，而真正的根本原因位于 `database`。Agent 必须追踪依赖链，而不仅仅是修补可见的告警。 ## 快速开始 ### 1. 通过 Docker 运行 ``` docker build -t devops_incident_env . docker run -p 8000:8000 devops_incident_env ``` 服务器在 `http://localhost:8000` 启动。访问 `/docs` 获取交互式 Swagger UI。 ### 2. 运行评估 Agent ``` export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" export HF_TOKEN="your_huggingface_token" export ENV_URL="http://127.0.0.1:8000" # or leave unset to use LocalEnvClient python inference.py ``` 需要 `HF_TOKEN`。如果缺少它，脚本会立即报错。 ### 3. 运行测试 ``` pytest tests/ -v ``` ## API 参考所有路由均符合 OpenEnv HTTP 规范。 | Method | Route | Description | |---|---|---| | `GET` | `/` | 环境清单 | | `GET` | `/health` | 存活检查 — 返回 `{"status": "healthy"}` | | `GET` | `/tasks` | 列出所有 4 个任务定义及其元数据 | | `GET` | `/manifest` | 完整环境 schema | | `POST` | `/reset` | 开始新回合：`{"task_id": "hard_task", "seed": 42}` | | `POST` | `/step` | 提交动作并接收观察与奖励 | | `GET` | `/state` | 当前回合状态快照 | | `GET` | `/grader` | 当前回合的最终评分器分数 | | `GET` | `/baseline` | 来自基于规则的 baseline 的下一个推荐动作 | | `GET` | `/sample_action` | `/baseline` 的别名 | ## 动作与观察 Schema ### 动作 (`POST /step`) ``` { "action": { "action_type": "diagnose", "service": "database", "diagnosis": "disk_full", "reasoning": "WAL logs indicate no space left on device. confidence=0.92" } } ``` **有效的 `action_type` 值：** `read_logs` · `query_metrics` · `diagnose` · `apply_fix` · `verify_health` · `list_services` · `inspect_dependencies` **有效的 `diagnosis` 值：** `service_crash` · `memory_leak` · `high_latency` · `connection_pool_exhaustion` · `disk_full` · `certificate_expired` · `config_drift` **有效的 `fix` 值：** `restart_service` · `memory_fix` · `clear_disk` · `scale_up` · `rollback_config` · `renew_certificate` · `drain_connections` · `clear_cache` ### 观察（来自 `/step` 和 `/reset` 的响应） ``` { "observation": { "action_result": "Retrieved logs for database.", "message": "Recent logs show a pattern consistent with disk full.", "logs": [...], "metrics": {...}, "service_summaries": [...], "active_alerts": [...], "dependency_graph": {...}, "step_number": 4, "max_steps": 12, "steps_remaining": 8, "available_services": [...], "available_actions": [...] }, "reward": 0.04, "done": false, "info": { "task_id": "hard_task", "last_action_error": null, "trajectory_reward": 0.16 } } ``` ## 奖励系统环境会输出密集的每步奖励，以及一个独立的确定性最终评分器分数。**所有奖励和分数都严格限制在 `(0,1)` 范围内** —— 绝不会恰好是 `0` 或 `1`。 ### 每步奖励 | Action | Reward | Condition | |---|---|---| | 对根本原因服务执行 `read_logs` / `query_metrics` | `+0.04` | 仅限首次 | | 对受影响服务执行 `read_logs` / `query_metrics` | `+0.03` | 仅限首次 | | `diagnose`（正确，根本原因） | `+0.08` | 正确识别失败模式 | | `diagnose`（正确，受影响服务） | `+0.03` | 正确识别症状 | | `apply_fix`（正确修复，正确服务） | `+0.12` | 在正确服务上应用正确修复 | | `verify_health`（修复后） | `+0.04` | 确认恢复 | | `inspect_dependencies`（首次） | `+0.02` | 新服务遍历 | | `list_services`（首次调用） | `+0.015` | 一次性发现奖励 | | 错误/无效/破坏性动作 | `+0.05` （保底值） | 通过评分器安全分数进行惩罚 | 每步奖励在输出前会被限制在 `[0.05, 0.95]`。终止步骤总是输出保底奖励 `0.05`；决定性的回合分数来自评分器。 ### 最终评分器分数评分器将四个加权维度组合成 `(0.05, 0.95)` 范围内的最终分数： | Component | Weight | What It Measures | |---|---|---| | 根本原因识别 | 35% | Agent 是否找到了正确的服务**以及**失败模式？ | | 解决方案 | 30% | 是否应用了正确的修复且修复成功？ | | 效率 | 20% | 实际步数与最优路径的对比？ | | 安全性 | 15% | Agent 是否避免了破坏性或无效动作？ | ``` final_score = 0.35 × root_id + 0.30 × resolution + 0.20 × efficiency + 0.15 × safety ``` 所有组件分数都限制在 `[0.05, 0.95]`。总分随后也进行相同的限制。 ## 示例追踪成功的 Hard 任务运行（Qwen/Qwen2.5-72B-Instruct）： ``` [START] task=hard_task env=devops_incident_env model=Qwen/Qwen2.5-72B-Instruct [STEP] step=1 action=list_services() reward=0.050 done=false error=null [STEP] step=2 action=read_logs(api_gateway) reward=0.050 done=false error=null [STEP] step=3 action=inspect_dependencies(api_gateway) reward=0.050 done=false error=null [STEP] step=4 action=read_logs(database) reward=0.050 done=false error=null [STEP] step=5 action=query_metrics(database) reward=0.050 done=false error=null [STEP] step=6 action=diagnose(database) reward=0.080 done=false error=null [STEP] step=7 action=apply_fix(database) reward=0.120 done=false error=null [STEP] step=8 action=verify_health(database) reward=0.050 done=true error=null [END] success=true steps=8 rewards=0.050,0.050,0.050,0.050,0.050,0.080,0.120,0.050 ``` Agent 正确地抵制了对 `api_gateway`（可见告警）的打补丁，并将依赖链追踪到 `database` 作为真正的根本原因。 ## Agent 性能对比为了展示环境的区分能力和抗利用能力，这里对比了一个被动的“傻瓜” baseline 和一个推理型 LLM agent（`Qwen2.5-72B-Instruct` 配合改进的 system prompt）： | Task | Difficulty | Reactive Baseline | LLM Agent | |---|---|---|---| | `easy_task` | 🟢 Easy | `0.800` | `0.880` | | `medium_task` | 🟡 Medium | `0.860` | `0.880` | | `hard_task` | 🔴 Hard | `0.117` | `0.880` | | `expert_task` | 🟣 Expert | `0.117` | `0.880` | *注意：由于表面症状与根本原因一致，被动 baseline 在 Easy 和 Medium 任务上得分尚可。然而，它在 Hard 和 Expert 任务上严重失败，因为环境需要从告警追踪依赖到真正的根本原因，并会严厉惩罚盲目的修复尝试。* 已提交的运行产物位于 `outputs/` 以供评估者验证： - `outputs/inference_baseline_run.txt` — 所有 4 个任务的完整 `[START]`/`[STEP]`/`[END]` 追踪 - `outputs/task_score_summary.json` — 每个任务的最终评分器分数 ## 项目结构 ``` devops_incident_env/ ├── server/ │ ├── app.py # FastAPI app — all HTTP routes + score clamping middleware │ └── environment.py # Core RL environment: step logic, reward emission, state tracking ├── tasks.py # Scenario configs, service graph, log/metric templates ├── grader.py # Deterministic final-score formula (4-component weighted sum) ├── models.py # Pydantic schemas: Action, Observation, State, Task ├── constants.py # All reward values, grader weights, score bounds (SCORE_FLOOR=0.05, SCORE_CEILING=0.95) ├── baseline.py # Rule-based baseline agent (used as inference fallback) ├── inference.py # Evaluation runner: LLM agent + structured [START]/[STEP]/[END] stdout ├── client.py # Thin HTTP client for remote environment interaction ├── openenv.yaml # OpenEnv spec manifest ├── requirements.txt # Python dependencies ├── Dockerfile # Container definition (python:3.11-slim, port 8000) ├── outputs/ │ ├── inference_baseline_run.txt # Committed baseline trace │ └── task_score_summary.json # Committed score snapshot └── tests/ ├── test_environment.py # Environment lifecycle and state tests └── test_fixes.py # Per-scenario fix and grader correctness tests ``` ## 设计原则 **确定性评分。** 最终分数由 `grader.py` 中的基于规则的逻辑计算得出，而非 LLM 评判。分数输出可重现且可解释。 **感知依赖的奖励塑形。** 调查依赖链中的正确服务比调查表面症状获得更多奖励。这旨在教导因果调试，而非症状修补。 **严格的分数边界。** `/step` 发出的每个奖励和 `/grader` 返回的每个分数都被限制在 `(0.05, 0.95)`。绝不返回 `0.0` 或 `1.0`。 **反滥用机制。** 应用错误的修复、修复已经健康的服务或重复修复均被视为破坏性动作。每次破坏性动作会将 Safety 分量降低 50%，从而产生强烈的信号以避免暴力修复。 **红鲱鱼分离。** 受影响（下游）服务会表现出看似合理的失败模式 —— 例如 `high_latency`、`connection_pool_exhaustion` —— 以此对 Agent 区分症状与根本原因的能力进行压力测试。 **多根本原因专家挑战。** 专家任务需要以协调的顺序解决两个独立的根本原因（`database` 和 `payment_service`）。部分修复会在 Resolution 分量中受到明确惩罚。 ## Space 可用性位于 `.github/workflows/space-keepalive.yml` 的 GitHub Actions workflow 每 10 分钟 ping 一次 `/health`，以防止 HuggingFace Space 在评估窗口期间休眠。实时环境：`https://aryanosh-devops-incident-response.hf.space`

标签：AI训练环境, Docker, OpenEnv, PyTorch, SRE, 事故处理, 依赖追踪, 偏差过滤, 内存泄漏, 安全防御评估, 容器化部署, 强化学习, 微服务调试, 故障响应, 服务重启, 根因分析, 站点可靠性工程, 系统运维, 自动化运维, 请求拦截, 逆向工具, 零配置