yourcoder0/Incident-Response
GitHub: yourcoder0/Incident-Response
一个基于Docker与FastAPI的AI驱动生产事件响应模拟环境,训练智能体在高压下完成真实SRE处置流程。
Stars: 0 | Forks: 0
```markdown
title: IncidentResponseEnv
emoji: 🚨
colorFrom: red
colorTo: red
sdk: docker
pinned: false
license: apache-2.0
tags:
- openenv
- reinforcement-learning
- incident-response
- sre
# 🚨 IncidentResponseEnv — AI-Driven Production Incident Response
## 🧠 What Makes This Environment Different
IncidentResponseEnv simulates the high-stakes world of **production incident response** — where AI agents must act as on-call SRE engineers, making time-critical decisions under pressure.
Unlike toy environments, agents must:
- **Prioritise correctly** under cascading failures (wrong order = real damage)
- **Follow escalation protocols** (wrong team = wasted time)
- **Communicate with customers** while simultaneously mitigating
- **Write postmortems** for regulatory and operational accountability
- **Avoid catastrophic actions** (promoting a replica prematurely causes data corruption)
## 🎯 Why This Environment?
Production incidents are **high-stakes, time-sensitive, policy-constrained decisions** that AI agents are increasingly being deployed to handle. This environment trains agents on the exact failure modes that cause real business damage:
- Wrong escalation path wastes 10+ minutes during an outage
- Flushing a poisoned cache before disabling writes causes immediate re-poisoning
- Missing customer communication during downtime violates SLA contracts
- Promoting a database replica prematurely causes irreversible data corruption
These are **not toy decisions** — they reflect real SRE runbooks used at scale.
## 🚀 Quick Start
### Local (Python)
```
git clone https://huggingface.co/spaces//incident-response-env
cd incident-response-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860
python inference.py --mock
```
### Docker
```
docker build -t incident-response-env .
docker run -p 7860:7860 \
-e API_BASE_URL=https://api.openai.com/v1 \
-e MODEL_NAME=gpt-4o-mini \
-e HF_TOKEN=sk-... \
incident-response-env
```
## 🌐 API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| GET | `/` | Root health check |
| GET | `/health` | Health monitoring |
| POST | `/reset` | Reset episode `{"task_id": "..."}` |
| POST | `/step` | Take action `{"action": {...}}` |
| GET | `/state` | Full serialised state |
| GET | `/tasks` | List all task IDs |
| GET | `/action_space` | Action schema |
| GET | `/obs_space` | Observation schema |
## 📦 Project Structure
```
incident_response_env/
├── env/
│ ├── environment.py # Main IncidentResponseEnv class
│ ├── models.py # Typed Pydantic schemas
│ └── reward_function.py # 7-component shaped reward
├── tasks/
│ └── task_definitions.py # 3 tasks: easy → hard
├── graders/
│ └── graders.py # Deterministic episode graders
server/
└── app.py # FastAPI HTTP server
inference.py # Baseline agent script
openenv.yaml # OpenEnv spec metadata
client.py # Typed HTTP client
```
## 🔭 Observation Space
```
class Observation(BaseModel):
task_id: str
step: int
max_steps: int
incident_id: str
alerts: List[Alert] # fired alerts with metrics
service_status: Dict[str, str] # service -> healthy|degraded|down
recent_deployments: List[Dict] # recent deploys with commit info
runbook: Dict[str, str] # available runbook steps
knowledge_base: Dict[str, str] # policy lookups
last_action_result: Optional[str]
assigned_severity: Optional[str] # sev1|sev2|sev3|sev4
tags: List[str]
done: bool
info: Dict[str, Any]
```
## ⚡ Action Space
```
class Action(BaseModel):
action_type: ActionType # One of 9 types below
query: Optional[str] # investigate
escalation_team: Optional[str] # escalate: database|networking|security|payments|platform|management
escalation_reason: Optional[str] # escalate
runbook_step: Optional[str] # mitigate: key from runbook
mitigation_note: Optional[str] # mitigate
message: Optional[str] # communicate
audience: Optional[str] # communicate: customers|team|stakeholders|management
deployment_id: Optional[str] # rollback
rollback_reason: Optional[str] # rollback
resolution_code: Optional[str] # resolve: fixed|rolled_back|mitigated|false_alarm
resolution_note: Optional[str] # resolve
postmortem: Optional[str] # resolve (required for hard task)
tags: Optional[List[str]] # tag
summary_text: Optional[str] # summarize
```
**Action types:** `investigate`, `escalate`, `mitigate`, `communicate`, `rollback`, `resolve`, `tag`, `summarize`, `request_info`
**Terminal action:** `resolve` ends the episode immediately.
## 🏆 Reward Function
All rewards are deterministic (no LLM calls). Scores always in `[0.0, 1.0]`.
| Component | Weight | Description |
|-----------|--------|-------------|
| `severity_accuracy` | 0.15 | Correct SEV level assigned |
| `investigation_quality` | 0.20 | Queried relevant metrics/logs |
| `mitigation_quality` | 0.20 | Applied correct runbook step |
| `communication_quality` | 0.15 | Notified correct audience |
| `escalation_accuracy` | 0.15 | Paged correct team |
| `resolution_quality` | 0.10 | Correct code + postmortem |
| `efficiency_bonus` | 0.05 | Resolved under step budget |
**Penalties:**
| Violation | Penalty |
|-----------|---------|
| Wrong escalation team | −0.10 |
| Wrong action order (flush before disable) | −0.15 |
| Missing required field | −0.05 |
| Premature replica promotion | −0.20 |
## 📋 Tasks
### Task 1 — Easy: Database Connection Spike (max 4 steps)
**Scenario:** User service DB connection pool at 95% following deployment DEPLOY-441.
**Objectives:**
1. Investigate the recent deployment
2. Apply mitigation (rollback or increase pool size)
3. Resolve with correct code
**Pass threshold:** 0.60 | **Baseline expected:** ~0.80
### Task 2 — Medium: Payment Service Degradation (max 5 steps)
**Scenario:** Checkout error rate at 2.1% after Stripe SDK v4 migration. Payment gateway latency 4200ms.
**Objectives:**
1. Escalate to **payments team** (not security, not database)
2. Communicate to customers via status page
3. Investigate and mitigate
4. Resolve
**Pass threshold:** 0.55 | **Baseline expected:** ~0.75
**Key failure mode:** Escalating to wrong team (security/database) = −0.10 penalty.
### Task 3 — Hard: Cascading Failure / Full Site Outage (max 6 steps)
**Scenario:** CDN at 0% availability + DB replication lag 45s + cache poisoning — all simultaneously triggered by DEPLOY-455 aggressive cache prefetching.
**Objectives:**
1. Assign SEV1
2. **Disable cache writes FIRST** — before flushing (critical order)
3. Flush poisoned cache
4. Escalate to platform team
5. Notify management (SEV1 requirement)
6. Communicate outage to customers
7. Resolve with 20+ word postmortem
**Pass threshold:** 0.50 | **Baseline expected:** ~0.65
**Key failure modes:**
- Flushing cache before disabling writes = −0.15 (re-poisoning)
- Promoting replica prematurely = −0.20 (data corruption)
- Not notifying management = −0.05
## 📊 Baseline Scores (mock mode)
| Task | Difficulty | Baseline Score | Passed |
|------|-----------|---------------|--------|
| task_easy_db_spike | Easy | ~0.80 | ✅ |
| task_medium_payment_degradation | Medium | ~0.75 | ✅ |
| task_hard_cascading_failure | Hard | ~0.65 | ✅ |
| **Average** | | **~0.73** | **3/3** |
## 🔧 Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
| `MODEL_NAME` | Model identifier | `gpt-4o-mini` |
| `HF_TOKEN` | API key | (required) |
| `PORT` | Server port | `7860` |
## 🚀 Key Innovations
- **Order-sensitive grading** — Task 3 detects whether `disable_cache_writes` came before `flush_cache` by scanning action history
- **Catastrophic action detection** — `promote_replica` during active incident = −0.20 penalty
- **7-component shaped reward** — agents receive training signal even from partial solutions
- **Real SRE runbooks** — tasks based on actual incident response patterns
- **Policy-gated escalation** — wrong team escalation penalised deterministically
## 📄 License
Apache 2.0
```
标签:AI智能体, API服务, Docker, HuggingFace, Python, SLA管理, SRE, Uvicorn, 偏差过滤, 告警处理, 在线客服, 安全防御评估, 实时决策, 应急预案, 开源, 强化学习, 故障模拟, 无后门, 灾难恢复, 生产环境, 请求拦截, 逆向工具