md-saqhib/incident-response-env

GitHub: md-saqhib/incident-response-env

一个符合 OpenEnv 标准的强化学习环境，用于训练 LLM 智能体在模拟环境中安全地诊断和修复生产系统故障。

Stars: 0 | Forks: 0

## 标题：IncidentResponseEnv emoji: 🎯 colorFrom: blue colorTo: green sdk: docker app_file: app/main.py pinned: false # 🎯 IncidentResponseEnv：生产事故响应训练 **一个符合 OpenEnv 标准的 RL 环境，LLM 智能体在此学习诊断和修复生产事故——安全、可衡量且可扩展。** [![在线演示](https://img.shields.io/badge/🌐%20Live%20Demo-HuggingFace%20Spaces-orange)](https://huggingface.co/spaces/Saqhibb/incident-response-env) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/) [![FastAPI](https://img.shields.io/badge/FastAPI-0.115.0-009688)](https://fastapi.tiangolo.com/) [![兼容 OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-green)](https://github.com/raun/openenv) [![测试通过](https://img.shields.io/badge/Tests-Passing-brightgreen)](./test_env.py) ## 🚀 立即在线体验 **无需安装！立即测试 API：** ``` # 检查 API 是否正在运行 curl https://Saqhibb-incident-response-env.hf.space/health # 启动新 incident curl -X POST https://Saqhibb-incident-response-env.hf.space/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "single_service_down"}' # 采取行动 curl -X POST https://Saqhibb-incident-response-env.hf.space/step \ -H "Content-Type: application/json" \ -d '{"action": "investigate", "target": "payment-service", "context": "Check memory usage"}' ``` **[👉 在 HuggingFace Spaces 打开完整交互式演示](https://huggingface.co/spaces/Saqhibb/incident-response-env)** ## 📋 这是什么？ IncidentResponseEnv 是一个**生产级 RL 环境**，其中： 1. **LLM 智能体接收实时告警**，了解生产问题（服务宕机、指标劣化、日志激增） 2. **智能体采取行动**来调查、诊断和修复问题 3. **智能体获得即时奖励反馈**（0.0–1.0），基于正确性和效率 4. **在模拟环境中进行安全训练**——不会对生产环境造成损害 ### 为什么这很重要 - **降低事故响应成本** 90%（相比传统培训） - **无限扩展训练**——随时、随地练习任何场景 - **系统化构建智能体能力**，进度可衡量 - **利用标准化基准研究 ML 中的事故响应** - **符合 OpenEnv 标准**——兼容 Llama、Mistral、GPT-4 及任何 LLM 框架 ## 🎮 工作原理：3 个渐进式任务在难度递增的任务中训练你的智能体： ### 1️⃣ **简单：单服务宕机** ``` 🚨 Scenario: payment-service crashes due to OutOfMemoryError ⏱️ Time: 300 seconds | Steps: max 10 📊 Difficulty: 🟢 Easy ``` - **问题**：一个服务出现 100% 错误率，影响下游服务 - **提示**：指标中的高内存使用率会指引你 - **解决方案**：诊断 OOM（内存溢出），重启服务 - **奖励**：+0.40 诊断，+0.35 修复，+0.15 时间奖励，+0.10 效率 **平均解决时间**：3-5 步 ### 2️⃣ **中等：级联故障** ``` 🚨 Scenario: Database connection pool exhaustion ⏱️ Time: 300 seconds | Steps: max 15 📊 Difficulty: 🟡 Medium ``` - **问题**：postgres-db 连接池限制为 100，且已全满 - **转折**：故障级联到多个服务，症状偏离根本原因 - **挑战**：必须识别池耗尽，尽管在 api-gateway、order-service 中看到错误 - **奖励**：多服务诊断的奖励略高 **平均解决时间**：6-10 步 ### 3️⃣ **困难：静默内存泄漏** ``` 🚨 Scenario: AnalyticsReportCache memory leak ⏱️ Time: 300 seconds | Steps: max 20 📊 Difficulty: 🔴 Hard ``` - **问题**：没有 CRITICAL 告警！内存在 6 小时内静默劣化 - **挑战**：必须在没有明显告警的情况下识别指标趋势 - **所需专业知识**：理解系统行为、趋势分析 - **奖励**：识别隐藏问题的奖励最高 **平均解决时间**：10-15 步 ## 📊 奖励结构智能体接收鼓励渐进式进步的细粒度反馈： ``` Correct Diagnosis → +0.40 (one-time, max once) Correct Fix → +0.35 (one-time, max once) Time Efficiency Bonus → +0.15 × (time_remaining / time_budget) Action Efficiency Bonus → +0.10 × (1 - wrong_actions / total_actions) Wrong Action Penalty → -0.10 per incorrect action ___________________________ MAX TOTAL: 1.0 (perfect episode) MIN TOTAL: 0.0 (clipped) ``` 每一步的**部分奖励**意味着智能体可以看到学习进度，无需等待回合结束。 ## 🔌 API 端点所有端点均为 **OpenEnv 标准**，适用于任何 LLM 智能体： ### `POST /reset` 启动一个新的任务回合 ``` curl -X POST https://Saqhibb-incident-response-env.hf.space/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "single_service_down"}' ``` **响应**：完整初始状态（alerts、metrics、logs、services） ### `POST /step` 执行一个动作并获取下一个观测值 + 奖励 ``` curl -X POST https://Saqhibb-incident-response-env.hf.space/step \ -H "Content-Type: application/json" \ -d '{"action": "investigate", "target": "payment-service", "context": "Memory analysis"}' ``` **响应**：observation、reward、done 标志、progress 信息 ### `GET /state` 获取当前状态（不执行动作） ``` curl https://Saqhibb-incident-response-env.hf.space/state ``` ### `GET /tasks` 列出所有可用的事故 ``` curl https://Saqhibb-incident-response-env.hf.space/tasks ``` ### `GET /health` 健康检查 ``` curl https://Saqhibb-incident-response-env.hf.space/health ``` ## 💻 Python 集成 ### 最快上手：直接 API 调用 ``` import requests # Reset response = requests.post( "https://Saqhibb-incident-response-env.hf.space/reset", json={"task_id": "single_service_down"} ) state = response.json() # Step response = requests.post( "https://Saqhibb-incident-response-env.hf.space/step", json={"action": "investigate", "target": "payment-service", "context": ""} ) result = response.json() print(f"Reward: {result['reward']}, Done: {result['done']}") ``` ### 全功能客户端 ``` from incident_response_client import IncidentResponseEnv env = IncidentResponseEnv("https://Saqhibb-incident-response-env.hf.space") state = env.reset("single_service_down") for step in range(10): action = {"action": "investigate", "target": "payment-service", "context": ""} result = env.step(action) print(f"Step {step}: Reward {result['reward']}, Done {result['done']}") if result['done']: break ``` **查看 [USAGE_EXAMPLES.md](USAGE_EXAMPLES.md) 获取完整的 Python/cURL 示例** ← 完整集成指南 ## 🎓 完整示例：解决事故 ``` #!/bin/bash BASE="https://Saqhibb-incident-response-env.hf.space" # 1. RESET - 启动新 incident echo "🚀 Starting incident..." RESET=$(curl -s -X POST $BASE/reset -H "Content-Type: application/json" \ -d '{"task_id": "single_service_down"}') echo $RESET | jq '.alerts[] | select(.severity == "critical")' # 2. INVESTIGATE - 收集信息 echo "🔍 Investigating payment-service..." STEP1=$(curl -s -X POST $BASE/step -H "Content-Type: application/json" \ -d '{"action": "investigate", "target": "payment-service", "context": "Check memory"}') echo "Reward: $(echo $STEP1 | jq .reward)" # 3. DIAGNOSE - 识别 root cause echo "🔧 Diagnosing..." STEP2=$(curl -s -X POST $BASE/step -H "Content-Type: application/json" \ -d '{"action": "diagnose", "target": "payment-service", "context": "OutOfMemoryError"}') echo "Reward: $(echo $STEP2 | jq .reward), Progress: $(echo $STEP2 | jq '.info.progress')" # 4. FIX - 应用解决方案 echo "✅ Fixing..." STEP3=$(curl -s -X POST $BASE/step -H "Content-Type: application/json" \ -d '{"action": "fix", "target": "payment-service", "context": "Restart"}') echo "Reward: $(echo $STEP3 | jq .reward), Done: $(echo $STEP3 | jq .done)" ``` ## 🏗️ 架构 ``` ┌─────────────────────────────────────────────────────────────┐ │ LLM Agent (Llama, Mistral, GPT-4, etc.) │ │ Uses OpenEnv standard reset/step interface │ └──────────────────┬──────────────────────────────────────────┘ │ ↓ ┌──────────────────────────────────────────────────────────────┐ │ FastAPI REST API │ │ POST /reset, POST /step, GET /state, GET /tasks │ │ Port: 7860 (HF Spaces) or 8000 (local) │ └──────────────────┬───────────────────────────────────────────┘ │ ↓ ┌──────────────────────────────────────────────────────────────┐ │ IncidentResponseEnv (Core Orchestrator) │ │ - Manages episode state │ │ - Routes actions to tasks │ │ - Calculates rewards │ └──────────────────┬───────────────────────────────────────────┘ │ ┌──────────┼──────────┐ ↓ ↓ ↓ ┌────────┬─────────┬──────────┐ │ Easy │ Medium │ Hard │ │ Task │ Task │ Task │ └────────┴─────────┴──────────┘ ↓ ↓ ↓ Alerts, Metrics, Logs, Services Status Reward Feedback ``` **类型安全设计**：所有请求/响应均由 Pydantic 模型验证 ## 💻 本地开发 ### 安装与运行 ``` # 1. Clone 并进入目录 git clone cd incident-response-env # 2. 安装依赖 pip install -r requirements.txt # 3. 启动 server uvicorn app.main:app --host 0.0.0.0 --port 8000 # 4. 在另一个终端测试 python example_agent_interaction.py # 或 curl http://localhost:8000/health ``` ### 运行测试 ``` pytest test_env.py test_api.py test_inference.py -v ``` 所有测试通过 ✅ ### 部署到 HF Spaces ``` # 初始化 git (如果尚未初始化) git init git remote add origin https://huggingface.co/spaces//incident-response-env # Commit 并 push git add . git commit -m "Initial commit" git push -u origin main ``` HF Spaces 自动构建 Docker 镜像并部署 🚀 ## 📚 文档 | 文档 | 用途 | |----------|---------| | **[PITCH.md](PITCH.md)** | 简洁的问题/解决方案/价值陈述 | | **[USAGE_EXAMPLES.md](USAGE_EXAMPLES.md)** | 完整 API 示例（Python、cURL、LLM 集成） | | **[QUICK_START.md](QUICK_START.md)** | 5 分钟设置指南 | | **[openenv.yaml](openenv.yaml)** | 正式 OpenEnv 规范 | | **[BUILD_SUMMARY.md](BUILD_SUMMARY.md)** | 完整技术概览 | ## 🏆 项目状态 | 组件 | 状态 | 备注 | |-----------|--------|-------| | 代码 | ✅ 完成 | 4,000+ 行，完全测试 | | API | ✅ 在线 | https://Saqhibb-incident-response-env.hf.space | | Docker | ✅ 已部署 | 在 HF Spaces 上自动构建 | | 测试 | ✅ 全部通过 | test_env.py, test_api.py, test_inference.py | | 文档 | ✅ 完成 | 完整使用指南和示例 | ## 🤖 配合 LLM 智能体使用 ### 示例：通过 HuggingFace API 使用 Llama 2 ``` import requests import json HF_API = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf" HF_TOKEN = "hf_your_token" ENV_URL = "https://Saqhibb-incident-response-env.hf.space" # Reset 环境 env_state = requests.post(f"{ENV_URL}/reset", json={"task_id": "single_service_down"}).json() # 询问 LLM 该怎么做 prompt = f"""You are an incident response expert. Current alerts: {json.dumps(env_state['alerts'][:2])} What action should you take? Respond with JSON: {{"action": "investigate", "target": "..."}}""" response = requests.post( HF_API, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json={"inputs": prompt} ) # 执行操作 action = json.loads(response.json()[0]['generated_text'].split("```json")[-1].split("```")[0]) result = requests.post(f"{ENV_URL}/step", json=action).json() print(f"Reward: {result['reward']}") ``` 查看 [USAGE_EXAMPLES.md](USAGE_EXAMPLES.md) 获取完整的 LLM 集成示例 ✓ ## 📈 基准测试 IncidentResponseEnv 为以下场景提供**标准化基准**： - **LLM 智能体事故响应性能** - **比较不同模型架构**（Llama vs Mistral vs GPT） - **衡量训练效率**（收敛速度、样本效率） - **理解复杂场景中的多步推理** ## 🎯 关键特性 ✅ **OpenEnv 标准** - 兼容任何 LLM 框架（Hugging Face、Ollama 等） ✅ **3 个渐进式任务** - 简单、中等、困难难度级别 ✅ **类型安全 API** - Pydantic 验证确保正确性 ✅ **在线演示** - 在 HF Spaces 上立即测试 ✅ **完整文档** - API 示例、Python 集成、LLM 设置 ✅ **生产就绪** - Docker 容器化、已测试、已部署 ✅ **研究级** - 适用于 ML 基准测试和评估 ## 📄 引用如果你在研究中使用 IncidentResponseEnv，请引用： ``` @software{IncidentResponseEnv2024, title={IncidentResponseEnv: An OpenEnv-Compliant Production Incident Response Training Environment}, author={Saqhibb}, year={2024}, url={https://huggingface.co/spaces/Saqhibb/incident-response-env} } ``` ## 📞 支持 - **在线 API**：https://Saqhibb-incident-response-env.hf.space/health - **GitHub**：[Repository URL] - **有问题？** 查看 [USAGE_EXAMPLES.md](USAGE_EXAMPLES.md) 或提交 issue ## 📜 许可证 MIT License - 详情见 LICENSE 文件 ## 🙏 致谢构建工具： - [FastAPI](https://fastapi.tiangolo.com/) - 现代异步 Web 框架 - [Pydantic](https://pydantic-settings.readthedocs.io/) - 类型安全模型 - [HuggingFace Spaces](https://huggingface.co/spaces) - 免费云端部署 - [OpenEnv](https://github.com/raun/openenv) - 标准 RL 环境接口 **准备好训练你的第一个智能体了吗？** 🚀 1. [试用在线演示](https://huggingface.co/spaces/Saqhibb/incident-response-env) 2. [阅读使用示例](USAGE_EXAMPLES.md) 3. [开始与你的 LLM 集成](USAGE_EXAMPLES.md#integration-with-llm-agents) 祝你好运！ 🎯 ## 🏃 快速开始（5 分钟） ### 1. 安装依赖 ``` pip install -r requirements.txt ``` ### 2. 启动服务器 ``` uvicorn app.main:app --host 0.0.0.0 --port 8000 ``` ### 3. 测试（新终端） ``` # 选项 A：运行示例 walkthrough python example_agent_interaction.py # 选项 B：使用 curl curl -X POST http://localhost:8000/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "single_service_down"}' ``` ### 4. 运行 LLM 智能体（可选） ``` export HF_TOKEN=hf_your_token_here python inference.py ``` 详细设置请参阅 [QUICK_START.md](QUICK_START.md) ## 🔌 API 端点 | Method | Path | 用途 | |--------|------|---------| | `POST` | `/reset` | 启动新回合 | | `POST` | `/step` | 执行动作 | | `GET` | `/state` | 获取当前状态 | | `GET` | `/tasks` | 列出所有任务 | | `GET` | `/health` | 健康检查 | ### 示例：解决事故 ``` # 1. 启动 episode curl -X POST http://localhost:8000/reset \ -d '{"task_id": "single_service_down"}' # 2. 调查 payment-service curl -X POST http://localhost:8000/step \ -d '{"action_type": "investigate", "target": "payment-service"}' # 3. 诊断问题 curl -X POST http://localhost:8000/step \ -d '{"action_type": "diagnose", "target": "OutOfMemoryError"}' # 4. 修复它 curl -X POST http://localhost:8000/step \ -d '{"action_type": "fix", "target": "payment-service"}' # 如果所有步骤正确，Reward = 1.0！ ✅ ``` ## 📊 奖励结构 - **正确诊断**：+0.40（一次性奖励） - **正确修复**：+0.35（一次性奖励） - **时间奖励**：+0.15 × (剩余时间 / 时间预算) - **效率奖励**：+0.10 × (1 - 错误动作数 / 总动作数) - **错误动作惩罚**：每次错误 -0.10 - **范围**：[0.0, 1.0]，带截断智能体随着进展获得**部分奖励**，鼓励增量学习。 ## 🏗️ 架构 ``` LLM Agent (Llama, Mistral, GPT, etc.) ↓ REST API (/reset, /step, /state) ↓ IncidentResponseEnv (core environment class) ↓ Task (SingleServiceDown, CascadingFailure, MemoryLeak) ↓ Reward Calculator ↓ Agent Feedback (observation + reward) ``` 使用 Pydantic 模型的**类型安全接口**确保正确性。 ## 🐳 Docker 部署 ``` # 构建 image docker build -t incident-response-env . # 本地运行 docker run -p 7860:7860 incident-response-env # 或者部署到 HF Spaces (自动) git push hf main ``` ## 📁 文件结构 ``` incident-response-env/ ├── app/ │ ├── main.py # FastAPI server (6 endpoints) │ ├── env.py # IncidentResponseEnv class │ ├── models.py # Pydantic data models │ └── tasks/ # 3 incident scenarios │ ├── base.py │ ├── task_easy.py │ ├── task_medium.py │ └── task_hard.py ├── inference.py # LLM agent (HuggingFace) ├── requirements.txt # Dependencies ├── Dockerfile # Docker config ├── openenv.yaml # OpenEnv spec └── test_*.py # Unit tests (all passing) ``` ## ✅ 测试所有测试通过： ``` python -m pytest test_env.py test_api.py test_inference.py -v ``` - ✅ **test_env.py**：核心环境（reset、step、state） - ✅ **test_api.py**：所有 6 个端点 - ✅ **test_inference.py**：LLM 智能体脚本 ## 🎓 了解更多 - **想了解 OpenEnv？** → 参阅 [OpenEnv Course](https://github.com/raun/openenv-course) - **想了解部署细节？** → 参阅 [BUILD_SUMMARY.md](BUILD_SUMMARY.md) - **想提交参加黑客松？** → 参阅 [HACKATHON_SUBMISSION_ROADMAP.md](HACKATHON_SUBMISSION_ROADMAP.md)

标签：AIOps, API集成, AV绕过, Docker, FastAPI, HuggingFace Spaces, LLM训练, OpenEnv, Python, 仿真模拟, 可观测性, 大模型智能体, 安全防御评估, 强化学习环境, 故障诊断, 无后门, 生产事故响应, 站点可靠性工程, 系统恢复, 自动化运维, 请求拦截, 运维自动化, 逆向工具