devhemanthac-commits/Incident-Response-Triage-OpenENV-Environment

GitHub: devhemanthac-commits/Incident-Response-Triage-OpenENV-Environment

一个用于评估 AI Agent 在生产环境事件分诊能力的模拟测试平台，包含 18 个难度递进的故障场景和细粒度评分机制。

Stars: 4 | Forks: 0

事件响应分诊

一个 AI 评估环境，其中 Agent 担任 SRE 值班工程师，实时对生产事件进行分诊。

功能特性 • 快速开始 • 工作原理 • 场景 • 评分标准 • API • 基线 Agent

## 为什么开发此项目 SRE/DevOps 团队每天要处理**数千条告警**。核心技能看似简单实则极难：面对原始指标、日志和上下文，工程师必须在时间压力下正确分类严重程度、路由到正确的团队、决定是否升级，并清晰阐述推理过程。此环境正是为了模拟这一决策循环。在此表现优秀的 Agent 可以直接应用于生产环境中的**值班自动化工具**。 ``` Alert Fires ──> Agent Observes ──> Agent Triages ──> Environment Grades ^ | | (max 2 attempts per episode) | └───────────────── Feedback Loop ───────────────────────┘ ``` ## 核心功能

### 级联故障对困难场景的错误分诊会触发后续级联告警 —— 就像生产环境一样。数据库连接池耗尽导致支付 API 崩溃。DNS 故障导致整个平台瘫痪。每次级联应用 `-0.20` 惩罚。错误的路由会带来真实后果。	### 时间序列指标每个指标都是一个 5 点时间序列（过去 5 分钟），而非单点快照。Agent 必须区分趋势（内存泄漏：60% -> 78%）和尖峰（瞬时 CPU 突发）。引用趋势的推理可获得 `+0.05` 奖励。
### 误报检测 3 个“狼来了”场景：夜间备份、预发布环境负载测试、金丝雀部署。高指标属于预期行为。尽管数字惊人，但正确分诊为低优先级可获得 `+0.10` 奖励。	### 专家团队路由每个场景都有一个最优团队 + 替代团队。路由到替代团队可获得部分分数（`0.15` vs `0.30`）。错误团队 = `0.0`。选择正确的团队至关重要 —— 错误的路由会浪费响应时间。
### 带适应性的多步回合每个回合最多 2 次尝试。在第 1 步之后，Agent 会看到后果观察结果（例如，"升级团队发现数据库已达连接限制"）。根据反馈调整答案的 Agent 将获得 `+0.05` 奖励。

## 快速开始 ### 本地设置 ``` # Clone 仓库 git clone https://github.com/devhemanthac-commits/Incident-Response-Triage-OpenENV-Environment.git cd Incident-Response-Triage-OpenENV-Environment # 安装依赖 pip install -r requirements.txt # 启动服务器 python app.py # 服务器运行于 http://localhost:5000 ``` ### Docker ``` docker build -t incident-triage . docker run --rm -p 5000:5000 incident-triage ``` ### 验证是否正常工作 ``` # 健康检查 curl http://localhost:5000/health # {"status": "ok"} # 重置为场景 curl -s -X POST http://localhost:5000/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "easy-1"}' | python -m json.tool # 提交分流决策 curl -s -X POST http://localhost:5000/step \ -H "Content-Type: application/json" \ -d '{ "severity": "P2", "team": "database", "escalate": false, "confidence": 0.9, "reasoning": "Disk at 95% on postgres-primary, trending upward from 88%" }' | python -m json.tool ``` ## 工作原理 ``` ┌─────────────────────────────────────────────────────────────────────┐ │ EPISODE LIFECYCLE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ POST /reset {"task_id": "hard-1"} │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────┐ │ │ │ OBSERVATION │ │ │ │ - alert_type, service_name │ │ │ │ - metrics (5-point time series) │ │ │ │ - logs_snippet (5-10 lines) │ │ │ │ - related_alerts, dependencies │ │ │ │ - recent_deployments │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────┐ │ │ │ AGENT DECISION │ │ │ │ POST /step { │ │ │ │ severity, team, escalate, │ │ │ │ confidence, reasoning │ │ │ │ } │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────┐ │ │ │ GRADING ENGINE │ │ │ │ severity (0.35) + routing (0.30) │ │ │ │ escalation(0.15) + reasoning(0.10) │ │ │ │ calibration(0.10) │ │ │ │ + trend bonus (+0.05) │ │ │ │ + FP bonus (+0.10) │ │ │ │ + adaptation (+0.05) │ │ │ │ - cascade penalty (-0.20) │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ┌───────┴───────┐ │ │ │ │ │ │ score >= 0.95 score < 0.95 │ │ OR attempt=2 AND attempt=1 │ │ │ │ │ │ ▼ ▼ │ │ DONE RETRY (step 2) │ │ + feedback │ │ + consequence obs │ │ + cascading alerts │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## 场景 ### 跨越 4 个难度等级的 18 个场景

Count	Target Score	Design Intent
5	~0.90	Clear single-signal alerts
5	~0.75	Ambiguous context, deploy red herrings
5	~0.65	Cascading failures, misleading correlations
3	~0.85	Expected behavior disguised as alerts

简单场景 (5)

| ID | 事件 | 告警类型 | 服务 | 预期结果 | |---|---|---|---|---| | `easy-1` | INC-0001 | `disk_full` | postgres-primary | P2 / database | | `easy-2` | INC-0002 | `5xx_errors` | payment-api | P0 / backend | | `easy-3` | INC-0003 | `ssl_cert_expiry` | cdn-edge | P1 / infra | | `easy-4` | INC-0004 | `cpu_spike` | user-auth | P3 / backend | | `easy-5` | INC-0005 | `brute_force` | user-auth | P1 / security |

中等场景 (5)

| ID | 事件 | 告警类型 | 服务 | 预期结果 | |---|---|---|---|---| | `medium-1` | INC-0006 | `5xx_errors` | checkout-service | P2 / backend | | `medium-2` | INC-0007 | `memory_leak` | recommendation-engine | P3 / backend | | `medium-3` | INC-0008 | `slow_queries` | postgres-primary | P2 / database | | `medium-4` | INC-0009 | `packet_loss` | service-mesh | P2 / network | | `medium-5` | INC-0010 | `stale_cache` | cdn-edge | P3 / frontend |

困难场景 (5) -- 含级联故障

| ID | 事件 | 根本原因 | 级联效应 | |---|---|---|---| | `hard-1` | INC-0011 | DB pool exhaustion | Payment API crash | | `hard-2` | INC-0012 | DNS resolution failure | Total platform outage | | `hard-3` | INC-0013 | Crypto-mining malware | Control plane compromise | | `hard-4` | INC-0014 | Shared library memory leak | Auth service down | | `hard-5` | INC-0015 | Upstream Stripe outage | *(无级联)* |

误报场景 (3)

| ID | 事件 | 表面现象 | 实际情况 | |---|---|---|---| | `false-positive-1` | INC-0016 | Memory spike to 89% | Nightly backup job (finishes in 5 min) | | `false-positive-2` | INC-0017 | CPU at 90% | Staging load test (non-production) | | `false-positive-3` | INC-0018 | 5% error rate | Canary deploy on 1% traffic (auto-rollback ready) |

## 评分标准 ### 基础评分 (最高 1.00) ``` Component Max Rule ───────────────────────────────────────────────────────── Severity 0.35 exact = 0.35 | off-by-one = 0.15 | else = 0.0 Routing 0.30 optimal team = 0.30 | alt team = 0.15 | else = 0.0 Escalation 0.15 exact match only Reasoning 0.10 0.05 (length > 30 chars) + 0.05 (key indicators) Calibration 0.10 high confidence when correct, low when wrong ───────────────────────────────────────────────────────── Base Total 1.00 ``` ### 奖励与惩罚 ``` Modifier Value Condition ───────────────────────────────────────────────────────── Trend analysis bonus +0.05 Reasoning references metric trends False positive bonus +0.10 Correct P3/P4 on false-positive scenarios Adaptation bonus +0.05 Improvement from step 1 to step 2 Cascade penalty -0.20 Per cascading failure triggered ───────────────────────────────────────────────────────── Final score capped to [0.0, 1.0] ``` ## Observation Schema Agent 在每一步都会收到此观察结果： ``` { "task_id": "hard-1", "incident_id": "INC-0011", "step_number": 0, "alert_type": "connection_pool_exhaustion", "service_name": "order-service", "error_message": "FATAL: too many clients already (max 100)", "metrics": { "cpu_percent": [45.0, 52.0, 68.0, 82.0, 95.0], "memory_percent": [60.0, 62.0, 65.0, 70.0, 78.0], "error_rate": [0.1, 0.5, 2.0, 8.0, 15.0], "latency_p99_ms": [120, 180, 450, 1200, 5000], "disk_percent": [45.0, 45.0, 45.0, 45.0, 45.0], "connections_active": [80, 85, 92, 98, 100], "timestamps": ["T-4min","T-3min","T-2min","T-1min","T-now"] }, "logs_snippet": "2024-01-17 09:15:02 ERROR order-service: connection pool exhausted...", "related_alerts": ["payment-api latency > 5s", "order-service error rate > 10%"], "service_dependencies": ["payment-api", "inventory-service", "postgres-primary"], "recent_deployments": [], "time_of_day": "2024-01-17T09:15:30Z", "feedback": "", "new_alert": "" } ``` ## Action Schema Agent 必须返回： ``` { "severity": "P0|P1|P2|P3|P4", "team": "backend|frontend|infra|database|security|network", "escalate": true, "confidence": 0.85, "reasoning": "Connection pool exhausted at 100/100, error rate spiking 0.1% -> 15%..." } ``` ### 严重程度指南 | 级别 | 含义 | 示例 | |---|---|---| | **P0** | 完全中断 / 活跃的安全漏洞 | 100% 错误率，挖矿攻击 | | **P1** | 严重降级 / 数据面临风险 | DB 连接池耗尽，SSL 证书过期 | | **P2** | 显著影响 / SLO 违约 | 磁盘已满，部署后错误 | | **P3** | 轻微降级 / 警告阈值 | 内存泄漏（早期），单个 Pod 问题 | | **P4** | 信息提示 / 预期行为 | 夜间备份，预发布环境负载测试 | ## API 参考 ### `GET /health` ``` curl http://localhost:5000/health ``` ``` {"status": "ok"} ``` ### `POST /reset` 初始化或重置到特定场景。 ``` curl -X POST http://localhost:5000/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "easy-1", "seed": 42}' ``` | 参数 | 类型 | 必填 | 描述 | |---|---|---|---| | `task_id` | string | 否 | 特定场景（例如 `"hard-3"`）。如果省略则随机选择。 | | `seed` | int | 否 | 随机种子，用于复现。默认值：`42`。 | | `session_id` | string | 否 | 隔离并发会话。默认值：`"default"`。 | **返回：** `TriageObservation`（不暴露标准答案） ### `POST /step` 提交分诊操作并接收评分反馈。 ``` curl -X POST http://localhost:5000/step \ -H "Content-Type: application/json" \ -d '{ "severity": "P2", "team": "database", "escalate": false, "confidence": 0.9, "reasoning": "Disk at 95%, trending upward from 88% over 5 minutes" }' ``` | 参数 | 类型 | 必填 | 描述 | |---|---|---|---| | `severity` | string | 是 | `P0` 到 `P4` | | `team` | string | 是 | `backend\|frontend\|infra\|database\|security\|network` | | `escalate` | boolean | 是 | 是否升级 | | `confidence` | float | 是 | `0.0` 到 `1.0` | | `reasoning` | string | 是 | 分诊决定的解释 | **返回：** ``` { "observation": { "..." }, "reward": { "score": 0.95, "severity_score": 0.35, "routing_score": 0.30, "escalation_score": 0.15, "reasoning_score": 0.10, "calibration_score": 0.10, "cascade_penalty": 0.0, "false_positive_bonus": 0.0, "trend_bonus": 0.05, "adaptation_bonus": 0.0, "feedback": "Perfect triage!" }, "done": true, "info": { "attempt": 1, "cascade_triggered": false, "cascade_penalty_total": 0.0 } } ``` ## 运行基线 Agent 内置的 `inference.py` 支持通过 OpenAI 兼容端点使用 **OpenAI** 和 **Google Gemini**。 ``` # 选项 1: OpenAI export OPENAI_API_KEY=sk-... export MODEL_NAME=gpt-4o-mini # 选项 2: Google Gemini export GEMINI_API_KEY=AIzaSy... # 默认为 gemini-2.5-flash # 运行 export API_BASE_URL=http://localhost:5000 python inference.py ``` **输出协议：** ``` [START] [STEP] task_id=easy-1 score=0.950 severity=0.35 routing=0.30 escalation=0.15 cascade=0.00 fp_bonus=0.00 trend=0.05 [STEP] task_id=easy-2 score=0.983 ... ... [END] overall_avg=0.8234 episodes=18 ``` ### 模拟模式（无需 API Key） ``` export MOCK_LLM=true python inference.py ``` 使用 seeded RNG（`seed=42`）进行可复现的随机决策 —— 适合在没有 API 成本的情况下进行端到端流程测试。 ## 运行确定性测试 Agent `test_agent.py` 包含针对所有 18 个场景的**硬编码最优决策**。无需 LLM。 ``` # 首先启动服务器 python app.py & # 运行测试 agent python test_agent.py ``` **预期输出：** ``` ==================================================================================================== MOCK AGENT TEST -- 18 SCENARIOS ==================================================================================================== [EASY] easy-1 INC-0001 score=1.000 PASS sev=0.35 route=0.30 esc=0.15 trend=0.05 fp=0.00 easy-2 INC-0002 score=0.983 PASS ... GROUP AVG: 0.988 [MEDIUM] GROUP AVG: 0.975 [HARD] GROUP AVG: 0.993 [FALSE-POSITIVE] GROUP AVG: 1.000 ==================================================================================================== OVERALL AVG: 0.9877 (18 scenarios) ==================================================================================================== ``` ## 项目结构 ``` . ├── app.py # Flask REST API server (3 endpoints) ├── environment.py # Core grading engine with cascading failures ├── models.py # Pydantic v2 data contracts ├── data.py # 18 incident scenario definitions ├── inference.py # LLM baseline agent (OpenAI / Gemini) ├── test_agent.py # Deterministic mock agent (no LLM) ├── test_verify.py # Feature verification tests (10 tests) ├── test_api.py # HTTP API integration tests ├── openenv.yaml # OpenEnv specification (v2.0.0) ├── requirements.txt # Python dependencies ├── Dockerfile # Production container image ├── log.md # Development & testing log └── STATUS.md # Project status report ``` ### 架构 ``` ┌──────────────────┐ │ Flask API │ │ (app.py) │ └────────┬─────────┘ │ ┌────────▼─────────┐ │ IncidentTriage │ │ Env Engine │ │(environment.py) │ └───┬─────────┬────┘ │ │ ┌────────▼──┐ ┌──▼────────┐ │ Pydantic │ │ Scenario │ │ Models │ │ Data │ │(models.py) │ │ (data.py) │ └────────────┘ └───────────┘ ``` ## 验证与测试 ``` # 1. 导入检查 python -c "from models import *; from environment import *; print('imports OK')" # 2. 功能验证 (10 个测试) python test_verify.py # 3. API 集成测试 python app.py & python test_api.py # 4. 完整 18 场景基线 python test_agent.py # 5. 可复现性检查 (运行 3 次，分数必须匹配) python test_agent.py && python test_agent.py && python test_agent.py # 所有运行在 seed=42 时产生相同分数 ``` ## 基线结果 | 等级 | 平均分 | 满分数量 | 状态 | |---|---|---|---| | Easy (5) | **0.988** | 2/5 | 优秀 | | Medium (5) | **0.975** | 1/5 | 优秀 | | Hard (5) | **0.993** | 3/5 | 近乎完美 | | False-Positive (3) | **1.000** | 3/3 | 完美 | | **Overall (18)** | **0.9877** | **10/18** | **生产就绪** | ## 技术栈 | 组件 | 技术 | |---|---| | 语言 | Python 3.11+ | | 数据验证 | Pydantic v2 | | API 框架 | Flask | | LLM Client | OpenAI SDK (兼容 Gemini) | | 容器化 | Docker (python:3.11-slim) | | 可复现性 | 确定性播种 (seed=42) |

_{为 OpenEnv Competition 构建 —— 在真实世界的 SRE 事件分诊中测试 AI Agent。}

标签：AI智能体评估, API集成, Docker, Flask, LLM基准测试, Pydantic, Python, REST API, SRE运维, 仿真环境, 可观测性, 告警分流, 安全防御评估, 故障分类, 无后门, 生产环境演练, 站点可靠性工程, 误报检测, 请求拦截, 运维自动化, 逆向工具

### 级联故障对困难场景的错误分诊会触发后续级联告警 —— 就像生产环境一样。数据库连接池耗尽导致支付 API 崩溃。DNS 故障导致整个平台瘫痪。每次级联应用 `-0.20` 惩罚。错误的路由会带来真实后果。	### 时间序列指标每个指标都是一个 5 点时间序列（过去 5 分钟），而非单点快照。Agent 必须区分趋势（内存泄漏：60% -> 78%）和尖峰（瞬时 CPU 突发）。引用趋势的推理可获得 `+0.05` 奖励。
### 误报检测 3 个“狼来了”场景：夜间备份、预发布环境负载测试、金丝雀部署。高指标属于预期行为。尽管数字惊人，但正确分诊为低优先级可获得 `+0.10` 奖励。	### 专家团队路由每个场景都有一个最优团队 + 替代团队。路由到替代团队可获得部分分数（`0.15` vs `0.30`）。错误团队 = `0.0`。选择正确的团队至关重要 —— 错误的路由会浪费响应时间。
### 带适应性的多步回合每个回合最多 2 次尝试。在第 1 步之后，Agent 会看到后果观察结果（例如，"升级团队发现数据库已达连接限制"）。根据反馈调整答案的 Agent 将获得 `+0.05` 奖励。