yashwanthprabhu07/incident-response-agent

GitHub: yashwanthprabhu07/incident-response-agent

一个基于 AI 与持久化记忆的 DevOps 事故响应代理，解决因组织遗忘导致的重复故障与过长 MTTR。

Stars: 0 | Forks: 0

``` ██╗███╗ ██╗ ██████╗██╗██████╗ ███████╗███╗ ██╗████████╗ ██║████╗ ██║██╔════╝██║██╔══██╗██╔════╝████╗ ██║╚══██╔══╝ ██║██╔██╗ ██║██║ ██║██║ ██║█████╗ ██╔██╗ ██║ ██║ ██║██║╚██╗██║██║ ██║██║ ██║██╔══╝ ██║╚██╗██║ ██║ ██║██║ ╚████║╚██████╗██║██████╔╝███████╗██║ ╚████║ ██║ ╚═╝╚═╝ ╚═══╝ ╚═════╝╚═╝╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═╝ R E S P O N S E A G E N T — P O W E R E D B Y H I N D S I G H T ```
**一个能够记住每一次事故、从每一次解决中学习并在每次故障中变得更智能的 AI 驱动的 DevOps 代理。**
[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-0.110+-009688?style=for-the-badge&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![Groq](https://img.shields.io/badge/Groq-LLM-F55036?style=for-the-badge&logo=groq&logoColor=white)](https://groq.com) [![Hindsight](https://img.shields.io/badge/Hindsight-Memory-6C3DE0?style=for-the-badge)](https://hindsight.vectorize.io) [![License](https://img.shields.io/badge/License-MIT-22C55E?style=for-the-badge)](LICENSE)

## 🚨 问题每个 DevOps 团队都面临同样的噩梦： - 生产环境在凌晨 2 点宕机 - 值班工程师从未见过这种错误 - 他们花费 45 分钟在 Slack 历史记录、Confluence 文档和旧的 Jira 工单中查找 - 最终发现 *三个月前有人已经修复过这个问题* - 总停机时间：**2 小时**。总成本：**数千美元**。 **根本原因不是 Bug，而是组织性遗忘。** ## 💡 解决方案 **事故响应代理（Incident Response Agent）** 是一个永不忘事的 AI SRE。每当发生事故时，它会： 1. **记住** — 将事故、根本原因和解决方案持久化存储 2. **回忆** — 当新事故发生时，立即展示相似的历史事故 3. **反思** — 综合所有事故的模式，主动预防未来的故障它不是聊天机器人，而是一个**随时间积累知识的运行大脑**。 ## ✨ 演示：之前 vs 之后 ### ❌ 无记忆（通用 LLM） ``` User: Our Kubernetes pods are in CrashLoopBackOff after a config update. LLM: CrashLoopBackOff usually means the container is crashing repeatedly. Check your logs with kubectl logs . Common causes include misconfiguration, missing dependencies, or resource limits... ``` *通用、无帮助。你已经知道这没什么用。* ### ✅ 有 Hindsight 记忆（我们的代理） ``` User: Our Kubernetes pods are in CrashLoopBackOff after a config update. Agent: 🔍 Recalling similar past incidents... MATCH FOUND — INC004 (18 min resolution): "Payment service pods entered CrashLoopBackOff after config map update. Root cause: missing PAYMENT_API_KEY environment variable." 📋 DIAGNOSIS: Most Likely Cause: Missing env var in config map after your update. Immediate Actions: 1. kubectl get configmap -o yaml → check PAYMENT_API_KEY is present 2. kubectl describe pod → look for "env var not found" 3. kubectl rollout restart deployment/payment-service after fix Estimated Resolution: ~18 minutes (based on past incident) Prevention: Add config validation step to your CI/CD pipeline ``` *具体、可操作、基于真实的组织记忆。* ## 🏗️ 架构 ``` ┌─────────────────────────────────────────────────────────────┐ │ USER / ENGINEER │ └─────────────────────────┬───────────────────────────────────┘ │ describes incident ▼ ┌─────────────────────────────────────────────────────────────┐ │ FASTAPI BACKEND │ │ api/main.py :8000 │ └──────┬──────────────────┬──────────────────┬───────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌───────────────┐ ┌──────────────────────┐ │ RUNBOOKS │ │ GROQ LLM API │ │ HINDSIGHT MEMORY │ │ runbooks.py │ │llama-3.3-70b │ │ vectorize.io │ │ │ │ │ │ │ │ • database │ │ Diagnosis │ │ retain() ←─ store │ │ • kubernetes │ │ Generation │ │ recall() ←─ search │ │ • memory │ │ │ │ reflect() ←─ learn │ │ • cpu/disk │ └───────────────┘ └──────────────────────┘ │ • ssl/network│ └──────────────┘ ``` ## 🧠 Hindsight 记忆如何工作 Hindsight 提供三种核心记忆操作来驱动该代理： | 操作 | 作用 | 使用时机 | |------|------|----------| | `retain()` | 将事故详情存储为持久化记忆 | 每次事故报告或解决后 | | `recall()` | 通过语义相似度搜索过去的事故 | 诊断任何新事故之前 | | `reflect()` | 综合所有存储的事故模式 | 周度运维复盘、主动预防 | ### 记忆流程 ``` New Incident Reported │ ▼ recall(incident) ──→ Top 5 similar past incidents │ ▼ LLM Diagnosis ──→ Root cause + actions + timeline │ ▼ retain(result) ──→ Stored for future recall │ ▼ (Weekly) reflect() ──→ "Database issues spike on Fridays after deploy" ``` ## 📁 项目结构 ``` incident-response-agent/ │ ├── 📁 agent/ │ ├── incident_agent.py # Main agent loop — CLI interface │ ├── hindsight_tools.py # retain / recall / reflect wrappers │ └── runbooks.py # 8 domain-specific runbooks │ ├── 📁 api/ │ └── main.py # FastAPI REST API (5 endpoints) │ ├── 📁 data/ │ └── synthetic_incidents.json # 10 realistic past incidents │ ├── 📁 frontend/ # Dashboard (coming soon) │ ├── .env # API keys (not committed) ├── requirements.txt # Python dependencies └── README.md # You are here ``` ## 🚀 快速开始 ### 前置条件 - Python 3.10+ - [Groq API Key](https://console.groq.com)（免费） - [Hindsight Cloud 账户](https://ui.hindsight.vectorize.io)（免费） ### 1. 克隆仓库 ``` git clone https://github.com/YOUR_USERNAME/incident-response-agent.git cd incident-response-agent ``` ### 2. 设置虚拟环境 ``` python -m venv venv # Windows venv\Scripts\activate # Mac/Linux source venv/bin/activate ``` ### 3. 安装依赖 ``` pip install fastapi uvicorn hindsight-client groq python-dotenv requests ``` ### 4. 配置环境变量在根目录创建 `.env` 文件： ``` GROQ_API_KEY=your_groq_api_key_here HINDSIGHT_BASE_URL=https://api.hindsight.vectorize.io HINDSIGHT_BANK_ID=incident-memory HINDSIGHT_API_KEY=your_hindsight_cloud_key_here ``` ### 5. 运行代理 ``` cd agent python incident_agent.py ``` ### 6. 种子历史事故到内存 ``` Enter choice (1-4): 1 ``` 这将加载 10 个真实的历史事故到 Hindsight 内存中，为代理提供初始知识库。 ### 7. 诊断你的第一个事故 ``` Enter choice (1-4): 2 ``` 描述任意生产事故，观察代理回忆相关历史事故并生成针对性诊断。 ## 🌐 REST API 启动 API 服务器： ``` cd api uvicorn main:app --reload --port 8000 ``` | 端点 | 方法 | 描述 | |------|------|------| | `GET /` | GET | 健康检查 | | `POST /seed` | POST | 将所有历史事故加载到内存 | | `POST /diagnose` | POST | 诊断新事故 | | `POST /recall` | POST | 搜索过去的事故记忆 | | `GET /patterns` | GET | 通过 reflect() 获取模式 | ### 示例 API 调用 ``` curl -X POST http://localhost:8000/diagnose \ -H "Content-Type: application/json" \ -d '{"description": "Kubernetes pods in CrashLoopBackOff after config update"}' ``` ## 📊 评判标准覆盖 | 标准 | 权重 | 我们如何解决 | |------|------|--------------| | **创新性** | 30% | 具有持久记忆并能积累知识的 AI 代理，而非无状态聊天机器人 | | **使用 Hindsight 记忆** | 25% | 全部 3 种操作：`retain`、`recall`、`reflect`，并有清晰的 before/after 对比 | | **技术实现** | 20% | Python 代理 + FastAPI 后端 + Hindsight Cloud + Groq LLM | | **用户体验** | 15% | 带有清晰菜单、表情反馈和结构化诊断报告的 CLI | | **实际影响** | 10% | 每个工程团队都会因重复事故损失数小时 — 此方案解决了该问题 | ## 🗺️ 路线图 - [ ] 带事故时间线可视化的 React 仪表板 - [ ] Slack/PagerDuty Webhook 集成，用于自动触发诊断 - [ ] 自动生成周度复盘报告 - [ ] 多团队内存库（按服务/小队） - [ ] 基于内存模式的事故严重性预测 ## 🛠️ 技术栈 | 层 | 技术 | |----|------| | 内存 | [Hindsight by Vectorize](https://hindsight.vectorize.io) | | LLM | [Groq](https://groq.com) — llama-3.3-70b-versatile | | 后端 | [FastAPI](https://fastapi.tiangolo.com) | | 代理 | Python 3.10+ | | HTTP 客户端 | requests | ## 👥 团队由 **Team [Your Team Name]** 倾心打造，用于 **Hindsight Hackathon — HackWithChennai 2026** ## 📄 许可证 MIT License — 详见 [LICENSE](LICENSE)。

**如果你的生产系统曾因相同原因两次宕机 — 这个代理就是为你而生的。** ⭐ 如果你觉得它有用，请给这个仓库加星！

标签：AI代理, API服务, API集成, AV绕过, FastAPI, Hindsight, LLM, Python, Python3.10, SRE, Unmanaged PE, 偏差过滤, 减少MTTR, 可观测性, 后端开发, 向量数据库, 夜间运维, 开源许可证, 故障恢复, 无后门, 智能运维, 模型推理, 生产环境监控, 自动化告警, 自动化运维, 记忆引擎, 逆向工具, 问题学习