MohitSalvi16/devops-incident-response-openenv

GitHub: MohitSalvi16/devops-incident-response-openenv

一个基于 OpenEnv 的 AI 代理训练与评测环境,用于解决 DevOps 与 SRE 场景下的实战事件响应问题。

Stars: 1 | Forks: 0

title: DevOps Incident Response emoji: 🚨 colorFrom: red colorTo: indigo sdk: docker app_port: 7860 pinned: false license: mit short_description: OpenEnv SRE incident-response environment for AI agents tags: - openenv - rl - agents - devops - sre - incident-response # DevOps Incident Response — OpenEnv Environment [![openenv](https://img.shields.io/badge/openenv-validated-brightgreen)](https://github.com/meta-pytorch/OpenEnv) [![tests](https://img.shields.io/badge/tests-24%20passing-brightgreen)]() [![tasks](https://img.shields.io/badge/tasks-4-blue)]() ## 为什么这个环境存在 Production incidents cost companies an average of **$5,600 per minute** of downtime (Gartner). On-call engineers must rapidly context-switch, read dense logs, correlate failures across services, and apply precise fixes — often at 3 AM. This environment provides a realistic, deterministic testbed for evaluating whether AI agents can handle that workflow: - **Log analysis** across multiple services - **Root-cause diagnosis** under ambiguity - **Code & configuration repair** with grader-verified fixes - **Multi-service correlation** for cascading failures - **Recovery verification** via simulated shell commands This is a domain we have not seen modelled in any existing OpenEnv environment, and one with immediate value for the RL / agent community: training agents on incident response directly reduces mean-time-to-resolution (MTTR) and improves system reliability. ## 快速开始 ### 运行服务器(本地) ``` pip install -r requirements.txt python -m server.app # listens on 0.0.0.0:${PORT:-7860} ``` ### 运行服务器(Docker) ``` docker build -t devops-incident-env . docker run -p 7860:7860 -e PORT=7860 devops-incident-env curl http://localhost:7860/health # → {"status":"healthy"} curl -X POST -H 'Content-Type: application/json' \ -d '{}' http://localhost:7860/reset # → initial observation ``` The Dockerfile honours `$PORT` (Hugging Face Spaces injects it), so the same image runs unchanged on HF Spaces and locally. ### 运行基础代理 ``` export API_BASE_URL=https://api.openai.com/v1 export MODEL_NAME=gpt-4o-mini export HF_TOKEN=sk-... python inference.py ``` If `HF_TOKEN` / `OPENAI_API_KEY` are unset, `inference.py` falls back to a deterministic heuristic baseline so the script always produces a reproducible score. ### 运行本地预提交验证器 ``` bash scripts/pre_validate.sh # full SKIP_DOCKER=1 bash scripts/pre_validate.sh # skip docker build ``` ## OpenEnv 契约 This environment uses `openenv.core.env_server.http_server.create_app` so all standard endpoints are wired automatically: | Method | Path | Purpose | |--------|----------------|-----------------------------------------------| | GET | `/health` | `{"status":"healthy"}` for liveness probes | | GET | `/metadata` | Environment name + description + version | | GET | `/schema` | JSON schemas for `action`, `observation`, `state` | | GET | `/openapi.json`| OpenAPI 3.x spec (used by `openenv validate`) | | POST | `/reset` | Reset and return initial observation | | POST | `/step` | Execute an action, return observation | | GET | `/state` | Current `episode_id` + `step_count` | | POST | `/mcp` | JSON-RPC MCP endpoint | | WS | `/ws` | Persistent WebSocket session | `openenv validate` passes: ``` [OK] Meta_Hackathon: Ready for multi-mode deployment ``` ## 动作空间 The agent issues a `DevOpsAction` (subclass of `openenv.core.env_server.types.Action`) per step: | `action_type` | `target` | `content` | |----------------|--------------------------|---------------------------------------------| | `read_log` | log filename | _empty_ | | `read_file` | file path | _empty_ | | `diagnose` | _empty_ | root-cause hypothesis | | `edit_file` | file path | **complete** replacement file contents | | `run_command` | _empty_ | shell command (e.g. `nginx -s reload`) | | `submit_fix` | _empty_ | summary of what was fixed | ## 观察空间 `DevOpsObservation` (subclass of `openenv.core.env_server.types.Observation`): | Field | Type | Description | |----------------------|------------------|----------------------------------------------| | `step` | `int` | Current step number | | `max_steps` | `int` | Episode step limit | | `task_id` | `str` | Active task identifier | | `task_description` | `str` | Human-readable objective | | `alert_message` | `str` | PagerDuty-style alert (first step only) | | `logs` | `dict[str,str]` | Log file contents keyed by filename | | `files` | `dict[str,str]` | Source/config files keyed by path | | `command_output` | `str` | Output of last `run_command` | | `system_status` | `str` | `up` / `degraded` / `down` | | `diagnosis_feedback` | `str` | Feedback on the last `diagnose` action | | `error` | `str` | Error if the last action was invalid | | `cumulative_reward` | `float` | Running total reward | | `final_score` | `float` | Normalised [0,1] score (set when `done`) | | `done` | `bool` | Episode terminated | | `reward` | `float` | Reward for this step | ## 奖励函数 Dense rewards are emitted on every step (no sparse end-of-episode reward). | Component | Reward | Trigger | |-----------------------|-----------------|---------------------------------------------| | Information gathering | +0.02 | First read of each log/file/command | | Partial diagnosis | +0.05 – 0.15 | Identifies some root causes | | Full diagnosis | +0.20 – 0.30 | Identifies all root causes | | Partial fix | +0.05 – 0.25 | Fixes some files | | Full fix | +0.40 – 0.50 | All files patched correctly | | Successful resolution | +0.20 | `submit_fix` with `system_status == "up"` | | Efficiency bonus | +0.00 – 0.10 | Fewer steps → higher bonus | | Repeated action | −0.05 × count | Escalating penalty for loops | | Wrong submit | −0.10 × count | `submit_fix` while system still down | | **Naked submit** | **−0.20 × count** | `submit_fix` without any prior `edit_file` (anti-exploit) | | Episode timeout | −0.10 | Max steps hit with system still down | The grader normalises the cumulative reward into a `final_score ∈ [0, 1]`. **No grader returns a constant score** — every task produces meaningfully varying scores depending on agent quality (verified by the `tests/test_tasks.py::test_grader_score_bounds` parametrised test). ## 任务 | ID | Difficulty | Max steps | Root causes | |---------------------------------|------------|-----------|-------------| | `easy_port_misconfiguration` | easy | 15 | 1 | | `medium_database_connection` | medium | 20 | 2 | | `medium_kubernetes_crashloop` | medium | 22 | 2 | | `hard_microservice_cascade` | hard | 25 | 5 | ### 简单 — 端口配置错误 Nginx is configured to `listen 8080;` but the load balancer expects port 80. Port 8080 is occupied by a monitoring agent. **** root cause; mostly tests log-reading and basic config editing. ### 中等 — 数据库连接池耗尽 The application returns 503s under load. **Two** interacting failures: the connection pool is sized at 2 (must be ≥ 10), AND the application code never releases connections back to the pool. The agent must edit BOTH `database.yml` AND `user_service.py`. ### 中等 — Kubernetes CrashLoopBackOff All 3 replicas of `payment-service` are in CrashLoopBackOff. **Two** interacting failures: (1) the Secret `payment-secrets` is missing the `database_url` key, AND (2) the liveness probe path is `/health` but the service only exposes `/healthz`. The agent must inspect kubectl events, the Deployment YAML, and the Secret YAML, and patch BOTH. ### 困难 — 微服务级联故障 A SEV-1 cascading outage across `api-gateway`, `order-service`, `inventory-service`, and Redis. **Five** interacting root causes: disabled circuit breaker, retry storm, missing inter-service timeout, unhandled `WatchError` on Redis, and Redis OOM with `noeviction` policy. Genuinely challenges frontier models — partial credit is awarded per fix. ## 基准分数 Reproducible from `python inference.py`: | Task | Heuristic agent | `mistral-small-latest`¹ | |-----------------------------------|-----------------|--------------------------| | `easy_port_misconfiguration` | **0.99** | **0.99** (10 steps) | | `medium_database_connection` | 0.01 | 0.01 | | `medium_kubernetes_crashloop` | **0.99** | **0.99** (11 steps) | | `hard_microservice_cascade` | 0.01 | 0.01 | | **Average** | **0.50** | **0.50** | | **Tasks passed (≥0.5)** | **2/4** | **2/4** | ¹ Representative run against `mistral-small-latest` via the Mistral La Plateforme API, `temperature=0`, on 2026-04-12. Hosted LLM APIs exhibit residual non-determinism even at zero temperature; across 3 independent runs the average score for this model ranged **0.23 – 0.50**, which is the exact kind of variance Phase 2 evaluation looks for (constant scores across runs are a disqualification criterion). The heuristic baseline is fully deterministic. Both "medium-DB" and "hard-cascade" remain unsolved by a small open model — they require multi-file coordinated edits and cascading root-cause analysis that genuinely challenges frontier models. **That gap is the agent-evaluation signal**: stronger models (GPT-4.1, Claude Opus, Llama-3.1-405B, Nemotron 3 Super) are expected to score significantly higher on the medium/hard tasks, producing meaningful score variance across the rubric. The inference script uses a flat (system + one user turn) prompting strategy so context stays under ~1.5k tokens per call, enabling small models (≤8B parameters) to complete the full trajectory without hitting context-length errors. ## 项目布局 ``` . ├── openenv.yaml # OpenEnv manifest (spec_version: 1) ├── pyproject.toml # project + [project.scripts] server entry ├── uv.lock # locked deps (required by openenv validate) ├── Dockerfile # python:3.10-slim + uvicorn ├── .dockerignore ├── README.md # this file ├── requirements.txt # pip-style deps mirroring pyproject.toml ├── inference.py # baseline agent (LLM + heuristic fallback) ├── models.py # OpenEnv-typed Action / Observation / State ├── server/ │ ├── __init__.py │ ├── app.py # create_app(...) + main() │ └── devops_environment.py # Environment subclass wrapping the inner env ├── env/ │ ├── env.py # core stateful env (reset/step/state) │ ├── grader.py # dense reward computation │ ├── models.py # internal action/observation │ └── tasks/ │ ├── base_task.py │ ├── task_registry.py │ ├── easy_port_misconfiguration.py │ ├── medium_database_connection.py │ ├── medium_kubernetes_crashloop.py # ← 4th task │ └── hard_microservice_cascade.py ├── tests/ # 24 unit tests │ ├── test_tasks.py │ └── test_env.py └── scripts/ └── pre_validate.sh # local pre-submission validator ``` ## 许可证 MIT
标签:AI代理, BurpSuite集成, cascading failures, deterministic testbed, Docker, grader-verified fixes, MTTR, on-call engineering, OpenEnv, RL, SRE, Vagrant, 事故响应, 人工智能代理, 代码修复, 偏差过滤, 多服务关联, 安全防御评估, 快速修复, 恢复验证, 故障诊断, 根因分析, 模拟Shell命令, 端口7860, 系统可靠性, 请求拦截, 逆向工具, 配置修复