yourcoder0/Incident-Response

GitHub: yourcoder0/Incident-Response

一个基于Docker与FastAPI的AI驱动生产事件响应模拟环境,训练智能体在高压下完成真实SRE处置流程。

Stars: 0 | Forks: 0

```markdown title: IncidentResponseEnv emoji: 🚨 colorFrom: red colorTo: red sdk: docker pinned: false license: apache-2.0 tags: - openenv - reinforcement-learning - incident-response - sre # 🚨 IncidentResponseEnv — AI-Driven Production Incident Response ## 🧠 What Makes This Environment Different IncidentResponseEnv simulates the high-stakes world of **production incident response** — where AI agents must act as on-call SRE engineers, making time-critical decisions under pressure. Unlike toy environments, agents must: - **Prioritise correctly** under cascading failures (wrong order = real damage) - **Follow escalation protocols** (wrong team = wasted time) - **Communicate with customers** while simultaneously mitigating - **Write postmortems** for regulatory and operational accountability - **Avoid catastrophic actions** (promoting a replica prematurely causes data corruption) ## 🎯 Why This Environment? Production incidents are **high-stakes, time-sensitive, policy-constrained decisions** that AI agents are increasingly being deployed to handle. This environment trains agents on the exact failure modes that cause real business damage: - Wrong escalation path wastes 10+ minutes during an outage - Flushing a poisoned cache before disabling writes causes immediate re-poisoning - Missing customer communication during downtime violates SLA contracts - Promoting a database replica prematurely causes irreversible data corruption These are **not toy decisions** — they reflect real SRE runbooks used at scale. ## 🚀 Quick Start ### Local (Python) ``` git clone https://huggingface.co/spaces//incident-response-env cd incident-response-env pip install -r requirements.txt uvicorn server.app:app --host 0.0.0.0 --port 7860 python inference.py --mock ``` ### Docker ``` docker build -t incident-response-env . docker run -p 7860:7860 \ -e API_BASE_URL=https://api.openai.com/v1 \ -e MODEL_NAME=gpt-4o-mini \ -e HF_TOKEN=sk-... \ incident-response-env ``` ## 🌐 API Endpoints | Method | Path | Description | |--------|------|-------------| | GET | `/` | Root health check | | GET | `/health` | Health monitoring | | POST | `/reset` | Reset episode `{"task_id": "..."}` | | POST | `/step` | Take action `{"action": {...}}` | | GET | `/state` | Full serialised state | | GET | `/tasks` | List all task IDs | | GET | `/action_space` | Action schema | | GET | `/obs_space` | Observation schema | ## 📦 Project Structure ``` incident_response_env/ ├── env/ │ ├── environment.py # Main IncidentResponseEnv class │ ├── models.py # Typed Pydantic schemas │ └── reward_function.py # 7-component shaped reward ├── tasks/ │ └── task_definitions.py # 3 tasks: easy → hard ├── graders/ │ └── graders.py # Deterministic episode graders server/ └── app.py # FastAPI HTTP server inference.py # Baseline agent script openenv.yaml # OpenEnv spec metadata client.py # Typed HTTP client ``` ## 🔭 Observation Space ``` class Observation(BaseModel): task_id: str step: int max_steps: int incident_id: str alerts: List[Alert] # fired alerts with metrics service_status: Dict[str, str] # service -> healthy|degraded|down recent_deployments: List[Dict] # recent deploys with commit info runbook: Dict[str, str] # available runbook steps knowledge_base: Dict[str, str] # policy lookups last_action_result: Optional[str] assigned_severity: Optional[str] # sev1|sev2|sev3|sev4 tags: List[str] done: bool info: Dict[str, Any] ``` ## ⚡ Action Space ``` class Action(BaseModel): action_type: ActionType # One of 9 types below query: Optional[str] # investigate escalation_team: Optional[str] # escalate: database|networking|security|payments|platform|management escalation_reason: Optional[str] # escalate runbook_step: Optional[str] # mitigate: key from runbook mitigation_note: Optional[str] # mitigate message: Optional[str] # communicate audience: Optional[str] # communicate: customers|team|stakeholders|management deployment_id: Optional[str] # rollback rollback_reason: Optional[str] # rollback resolution_code: Optional[str] # resolve: fixed|rolled_back|mitigated|false_alarm resolution_note: Optional[str] # resolve postmortem: Optional[str] # resolve (required for hard task) tags: Optional[List[str]] # tag summary_text: Optional[str] # summarize ``` **Action types:** `investigate`, `escalate`, `mitigate`, `communicate`, `rollback`, `resolve`, `tag`, `summarize`, `request_info` **Terminal action:** `resolve` ends the episode immediately. ## 🏆 Reward Function All rewards are deterministic (no LLM calls). Scores always in `[0.0, 1.0]`. | Component | Weight | Description | |-----------|--------|-------------| | `severity_accuracy` | 0.15 | Correct SEV level assigned | | `investigation_quality` | 0.20 | Queried relevant metrics/logs | | `mitigation_quality` | 0.20 | Applied correct runbook step | | `communication_quality` | 0.15 | Notified correct audience | | `escalation_accuracy` | 0.15 | Paged correct team | | `resolution_quality` | 0.10 | Correct code + postmortem | | `efficiency_bonus` | 0.05 | Resolved under step budget | **Penalties:** | Violation | Penalty | |-----------|---------| | Wrong escalation team | −0.10 | | Wrong action order (flush before disable) | −0.15 | | Missing required field | −0.05 | | Premature replica promotion | −0.20 | ## 📋 Tasks ### Task 1 — Easy: Database Connection Spike (max 4 steps) **Scenario:** User service DB connection pool at 95% following deployment DEPLOY-441. **Objectives:** 1. Investigate the recent deployment 2. Apply mitigation (rollback or increase pool size) 3. Resolve with correct code **Pass threshold:** 0.60 | **Baseline expected:** ~0.80 ### Task 2 — Medium: Payment Service Degradation (max 5 steps) **Scenario:** Checkout error rate at 2.1% after Stripe SDK v4 migration. Payment gateway latency 4200ms. **Objectives:** 1. Escalate to **payments team** (not security, not database) 2. Communicate to customers via status page 3. Investigate and mitigate 4. Resolve **Pass threshold:** 0.55 | **Baseline expected:** ~0.75 **Key failure mode:** Escalating to wrong team (security/database) = −0.10 penalty. ### Task 3 — Hard: Cascading Failure / Full Site Outage (max 6 steps) **Scenario:** CDN at 0% availability + DB replication lag 45s + cache poisoning — all simultaneously triggered by DEPLOY-455 aggressive cache prefetching. **Objectives:** 1. Assign SEV1 2. **Disable cache writes FIRST** — before flushing (critical order) 3. Flush poisoned cache 4. Escalate to platform team 5. Notify management (SEV1 requirement) 6. Communicate outage to customers 7. Resolve with 20+ word postmortem **Pass threshold:** 0.50 | **Baseline expected:** ~0.65 **Key failure modes:** - Flushing cache before disabling writes = −0.15 (re-poisoning) - Promoting replica prematurely = −0.20 (data corruption) - Not notifying management = −0.05 ## 📊 Baseline Scores (mock mode) | Task | Difficulty | Baseline Score | Passed | |------|-----------|---------------|--------| | task_easy_db_spike | Easy | ~0.80 | ✅ | | task_medium_payment_degradation | Medium | ~0.75 | ✅ | | task_hard_cascading_failure | Hard | ~0.65 | ✅ | | **Average** | | **~0.73** | **3/3** | ## 🔧 Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` | | `MODEL_NAME` | Model identifier | `gpt-4o-mini` | | `HF_TOKEN` | API key | (required) | | `PORT` | Server port | `7860` | ## 🚀 Key Innovations - **Order-sensitive grading** — Task 3 detects whether `disable_cache_writes` came before `flush_cache` by scanning action history - **Catastrophic action detection** — `promote_replica` during active incident = −0.20 penalty - **7-component shaped reward** — agents receive training signal even from partial solutions - **Real SRE runbooks** — tasks based on actual incident response patterns - **Policy-gated escalation** — wrong team escalation penalised deterministically ## 📄 License Apache 2.0 ```
标签:AI智能体, API服务, Docker, HuggingFace, Python, SLA管理, SRE, Uvicorn, 偏差过滤, 告警处理, 在线客服, 安全防御评估, 实时决策, 应急预案, 开源, 强化学习, 故障模拟, 无后门, 灾难恢复, 生产环境, 请求拦截, 逆向工具