MohitSalvi16/devops-incident-response-openenv
GitHub: MohitSalvi16/devops-incident-response-openenv
一个基于 OpenEnv 的 AI 代理训练与评测环境,用于解决 DevOps 与 SRE 场景下的实战事件响应问题。
Stars: 1 | Forks: 0
title: DevOps Incident Response
emoji: 🚨
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv SRE incident-response environment for AI agents
tags:
- openenv
- rl
- agents
- devops
- sre
- incident-response
# DevOps Incident Response — OpenEnv Environment
[](https://github.com/meta-pytorch/OpenEnv)
[]()
[]()
## 为什么这个环境存在
Production incidents cost companies an average of **$5,600 per minute** of downtime (Gartner). On-call engineers must rapidly context-switch, read dense logs, correlate failures across services, and apply precise fixes — often at 3 AM. This environment provides a realistic, deterministic testbed for evaluating whether AI agents can handle that workflow:
- **Log analysis** across multiple services
- **Root-cause diagnosis** under ambiguity
- **Code & configuration repair** with grader-verified fixes
- **Multi-service correlation** for cascading failures
- **Recovery verification** via simulated shell commands
This is a domain we have not seen modelled in any existing OpenEnv environment, and one with immediate value for the RL / agent community: training agents on incident response directly reduces mean-time-to-resolution (MTTR) and improves system reliability.
## 快速开始
### 运行服务器(本地)
```
pip install -r requirements.txt
python -m server.app # listens on 0.0.0.0:${PORT:-7860}
```
### 运行服务器(Docker)
```
docker build -t devops-incident-env .
docker run -p 7860:7860 -e PORT=7860 devops-incident-env
curl http://localhost:7860/health # → {"status":"healthy"}
curl -X POST -H 'Content-Type: application/json' \
-d '{}' http://localhost:7860/reset # → initial observation
```
The Dockerfile honours `$PORT` (Hugging Face Spaces injects it), so the same image runs unchanged on HF Spaces and locally.
### 运行基础代理
```
export API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4o-mini
export HF_TOKEN=sk-...
python inference.py
```
If `HF_TOKEN` / `OPENAI_API_KEY` are unset, `inference.py` falls back to a deterministic heuristic baseline so the script always produces a reproducible score.
### 运行本地预提交验证器
```
bash scripts/pre_validate.sh # full
SKIP_DOCKER=1 bash scripts/pre_validate.sh # skip docker build
```
## OpenEnv 契约
This environment uses `openenv.core.env_server.http_server.create_app` so all standard endpoints are wired automatically:
| Method | Path | Purpose |
|--------|----------------|-----------------------------------------------|
| GET | `/health` | `{"status":"healthy"}` for liveness probes |
| GET | `/metadata` | Environment name + description + version |
| GET | `/schema` | JSON schemas for `action`, `observation`, `state` |
| GET | `/openapi.json`| OpenAPI 3.x spec (used by `openenv validate`) |
| POST | `/reset` | Reset and return initial observation |
| POST | `/step` | Execute an action, return observation |
| GET | `/state` | Current `episode_id` + `step_count` |
| POST | `/mcp` | JSON-RPC MCP endpoint |
| WS | `/ws` | Persistent WebSocket session |
`openenv validate` passes:
```
[OK] Meta_Hackathon: Ready for multi-mode deployment
```
## 动作空间
The agent issues a `DevOpsAction` (subclass of `openenv.core.env_server.types.Action`) per step:
| `action_type` | `target` | `content` |
|----------------|--------------------------|---------------------------------------------|
| `read_log` | log filename | _empty_ |
| `read_file` | file path | _empty_ |
| `diagnose` | _empty_ | root-cause hypothesis |
| `edit_file` | file path | **complete** replacement file contents |
| `run_command` | _empty_ | shell command (e.g. `nginx -s reload`) |
| `submit_fix` | _empty_ | summary of what was fixed |
## 观察空间
`DevOpsObservation` (subclass of `openenv.core.env_server.types.Observation`):
| Field | Type | Description |
|----------------------|------------------|----------------------------------------------|
| `step` | `int` | Current step number |
| `max_steps` | `int` | Episode step limit |
| `task_id` | `str` | Active task identifier |
| `task_description` | `str` | Human-readable objective |
| `alert_message` | `str` | PagerDuty-style alert (first step only) |
| `logs` | `dict[str,str]` | Log file contents keyed by filename |
| `files` | `dict[str,str]` | Source/config files keyed by path |
| `command_output` | `str` | Output of last `run_command` |
| `system_status` | `str` | `up` / `degraded` / `down` |
| `diagnosis_feedback` | `str` | Feedback on the last `diagnose` action |
| `error` | `str` | Error if the last action was invalid |
| `cumulative_reward` | `float` | Running total reward |
| `final_score` | `float` | Normalised [0,1] score (set when `done`) |
| `done` | `bool` | Episode terminated |
| `reward` | `float` | Reward for this step |
## 奖励函数
Dense rewards are emitted on every step (no sparse end-of-episode reward).
| Component | Reward | Trigger |
|-----------------------|-----------------|---------------------------------------------|
| Information gathering | +0.02 | First read of each log/file/command |
| Partial diagnosis | +0.05 – 0.15 | Identifies some root causes |
| Full diagnosis | +0.20 – 0.30 | Identifies all root causes |
| Partial fix | +0.05 – 0.25 | Fixes some files |
| Full fix | +0.40 – 0.50 | All files patched correctly |
| Successful resolution | +0.20 | `submit_fix` with `system_status == "up"` |
| Efficiency bonus | +0.00 – 0.10 | Fewer steps → higher bonus |
| Repeated action | −0.05 × count | Escalating penalty for loops |
| Wrong submit | −0.10 × count | `submit_fix` while system still down |
| **Naked submit** | **−0.20 × count** | `submit_fix` without any prior `edit_file` (anti-exploit) |
| Episode timeout | −0.10 | Max steps hit with system still down |
The grader normalises the cumulative reward into a `final_score ∈ [0, 1]`. **No grader returns a constant score** — every task produces meaningfully varying scores depending on agent quality (verified by the `tests/test_tasks.py::test_grader_score_bounds` parametrised test).
## 任务
| ID | Difficulty | Max steps | Root causes |
|---------------------------------|------------|-----------|-------------|
| `easy_port_misconfiguration` | easy | 15 | 1 |
| `medium_database_connection` | medium | 20 | 2 |
| `medium_kubernetes_crashloop` | medium | 22 | 2 |
| `hard_microservice_cascade` | hard | 25 | 5 |
### 简单 — 端口配置错误
Nginx is configured to `listen 8080;` but the load balancer expects port 80. Port 8080 is occupied by a monitoring agent. **** root cause; mostly tests log-reading and basic config editing.
### 中等 — 数据库连接池耗尽
The application returns 503s under load. **Two** interacting failures: the connection pool is sized at 2 (must be ≥ 10), AND the application code never releases connections back to the pool. The agent must edit BOTH `database.yml` AND `user_service.py`.
### 中等 — Kubernetes CrashLoopBackOff
All 3 replicas of `payment-service` are in CrashLoopBackOff. **Two** interacting failures: (1) the Secret `payment-secrets` is missing the `database_url` key, AND (2) the liveness probe path is `/health` but the service only exposes `/healthz`. The agent must inspect kubectl events, the Deployment YAML, and the Secret YAML, and patch BOTH.
### 困难 — 微服务级联故障
A SEV-1 cascading outage across `api-gateway`, `order-service`, `inventory-service`, and Redis. **Five** interacting root causes: disabled circuit breaker, retry storm, missing inter-service timeout, unhandled `WatchError` on Redis, and Redis OOM with `noeviction` policy. Genuinely challenges frontier models — partial credit is awarded per fix.
## 基准分数
Reproducible from `python inference.py`:
| Task | Heuristic agent | `mistral-small-latest`¹ |
|-----------------------------------|-----------------|--------------------------|
| `easy_port_misconfiguration` | **0.99** | **0.99** (10 steps) |
| `medium_database_connection` | 0.01 | 0.01 |
| `medium_kubernetes_crashloop` | **0.99** | **0.99** (11 steps) |
| `hard_microservice_cascade` | 0.01 | 0.01 |
| **Average** | **0.50** | **0.50** |
| **Tasks passed (≥0.5)** | **2/4** | **2/4** |
¹ Representative run against `mistral-small-latest` via the Mistral La Plateforme API, `temperature=0`, on 2026-04-12. Hosted LLM APIs exhibit residual non-determinism even at zero temperature; across 3 independent runs the average score for this model ranged **0.23 – 0.50**, which is the exact kind of variance Phase 2 evaluation looks for (constant scores across runs are a disqualification criterion). The heuristic baseline is fully deterministic. Both "medium-DB" and "hard-cascade" remain unsolved by a small open model — they require multi-file coordinated edits and cascading root-cause analysis that genuinely challenges frontier models. **That gap is the agent-evaluation signal**: stronger models (GPT-4.1, Claude Opus, Llama-3.1-405B, Nemotron 3 Super) are expected to score significantly higher on the medium/hard tasks, producing meaningful score variance across the rubric.
The inference script uses a flat (system + one user turn) prompting strategy so context stays under ~1.5k tokens per call, enabling small models (≤8B parameters) to complete the full trajectory without hitting context-length errors.
## 项目布局
```
.
├── openenv.yaml # OpenEnv manifest (spec_version: 1)
├── pyproject.toml # project + [project.scripts] server entry
├── uv.lock # locked deps (required by openenv validate)
├── Dockerfile # python:3.10-slim + uvicorn
├── .dockerignore
├── README.md # this file
├── requirements.txt # pip-style deps mirroring pyproject.toml
├── inference.py # baseline agent (LLM + heuristic fallback)
├── models.py # OpenEnv-typed Action / Observation / State
├── server/
│ ├── __init__.py
│ ├── app.py # create_app(...) + main()
│ └── devops_environment.py # Environment subclass wrapping the inner env
├── env/
│ ├── env.py # core stateful env (reset/step/state)
│ ├── grader.py # dense reward computation
│ ├── models.py # internal action/observation
│ └── tasks/
│ ├── base_task.py
│ ├── task_registry.py
│ ├── easy_port_misconfiguration.py
│ ├── medium_database_connection.py
│ ├── medium_kubernetes_crashloop.py # ← 4th task
│ └── hard_microservice_cascade.py
├── tests/ # 24 unit tests
│ ├── test_tasks.py
│ └── test_env.py
└── scripts/
└── pre_validate.sh # local pre-submission validator
```
## 许可证
MIT
标签:AI代理, BurpSuite集成, cascading failures, deterministic testbed, Docker, grader-verified fixes, MTTR, on-call engineering, OpenEnv, RL, SRE, Vagrant, 事故响应, 人工智能代理, 代码修复, 偏差过滤, 多服务关联, 安全防御评估, 快速修复, 恢复验证, 故障诊断, 根因分析, 模拟Shell命令, 端口7860, 系统可靠性, 请求拦截, 逆向工具, 配置修复