Siddhanthguptaa/Sre_Gym
GitHub: Siddhanthguptaa/Sre_Gym
一个面向生产事故自动化响应的确定性强化学习环境,通过模拟微服务级联故障场景来训练和评测智能体的根因诊断与复盘能力。
Stars: 0 | Forks: 0
## 标题: SRE Gym
emoji: "🚨"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
", "payload": {}}
```
- `query_logs` — 获取指定服务的日志(如果是故障服务则 +0.02)
- `query_metrics` — 获取服务的实时指标(如果是故障服务则 +0.02)
- `submit_diagnosis` — 识别出根服务及故障模式(必须是精确的规范名称)
- `apply_remediation` — 执行修复 playbook(仅在根服务上有效)
- `escalate` — 升级给 on-call 团队
- `submit_postmortem` — 提供时间线、根本原因、受影响的服务和预防步骤
- `close_incident` — 提前结束 episode(扣除 50% 的分数)
### Gymnasium 接口(用于 RL 训练)
```
from gym_wrapper import SREGymEnv
from stable_baselines3 import PPO
env = SREGymEnv("task_easy_1", seed=42)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
```
## 架构
```
sre_gym/
├── server/
│ ├── app.py — FastAPI + dashboard serving
│ ├── sre_environment.py — Core step/reset/state machine
│ ├── graders.py — Deterministic scoring (0.92-threshold fuzzy match)
│ ├── scenario_generator.py — Procedural scenario generation
│ └── static/
│ └── dashboard.html — Live incident command center
├── scenarios/ — 12 curated JSON scenarios
├── models.py — Pydantic Action/Observation/State
├── gym_wrapper.py — Gymnasium env (stable-baselines3 compatible)
├── train_ppo.py — PPO training + reward curve generation
├── agent_graph.py — Graph-aware BFS agent (topology traversal)
├── inference.py — LLM baseline script
├── baseline_deterministic.py — Rule-based baseline (no API keys)
├── client.py — Typed async client
├── openenv.yaml — OpenEnv specification
├── Dockerfile — Multi-stage Docker build
└── pyproject.toml — Hatchling build config
```
### 评分架构
```
submit_postmortem()
│
├── grade_easy() → health(20%) + diagnosis(30%) + remediation(25%) + MTTR(25%)
├── grade_medium() → health(15%) + diagnosis(30%) + remediation(15%) + MTTR(40%)
│ red_herring_penalty(−40%)
└── grade_hard() → health(10%) + diagnosis(10%) + postmortem(80%)
[root_cause(25%) + affected(15%) + timeline(15%) + prevention(25%)]
│
└── compute_mttr_bonus() → score = base×0.80 + efficiency×0.20
(speed cannot compensate for wrong answer)
```
## 运行测试
```
pip install -e ".[dev]"
pytest tests/ -v
python agent_graph.py # integration smoke test
python gym_wrapper.py # gymnasium spec check
```
## 许可证
MIT
🚨 SRE-Gym
一个用于生产事故响应的确定性 RL 环境 — 包含依赖图遍历、级联故障模拟、对抗性干扰项,并且是唯一具备 submit_postmortem 机制且不使用 LLM 评委的 SRE 基准。
标签:AIOps, Python, 强化学习, 强化学习环境, 故障诊断, 无后门, 智能运维, 请求拦截, 逆向工具