Gaurishkumar/ai-incident-response-team

GitHub: Gaurishkumar/ai-incident-response-team

该平台利用多智能体 AI 流水线自动分析生产事件日志与指标，帮助工程师快速定位故障根因并获取修复建议。

Stars: 0 | Forks: 0

# AI 事件响应与 DevOps Copilot 一个事件驱动的平台，利用多智能体 AI pipeline 自动分析生产事件。当工程师提交包含日志和指标的事件时，系统会运行三个顺序执行的 AI 智能体 —— 日志分析、根因识别和修复建议 —— 并将结构化的分析结果实时返回给前端。本项目作为一个生产级作品集项目构建，展示了微服务架构、异步消息传递、LangGraph 智能体编排以及全链路可观测性。 ## 架构 ``` ┌──────────────────────────────────────────────────────────────────┐ │ BROWSER (Next.js) │ │ Tailwind CSS + ShadcnUI + TypeScript │ └──────────────────────────┬───────────────────────────────────────┘ │ REST + WebSocket ┌──────────────────────────▼───────────────────────────────────────┐ │ SPRING BOOT BACKEND │ │ Auth │ REST APIs │ Business Logic │ Event Publisher │ │ │ │ Incident creation: atomic 3-table insert (incidents + │ │ incident_metrics + incident_logs) — publishes only after │ │ transaction commits (TransactionSynchronization.afterCommit) │ │ │ │ PostgreSQL (read + write) Redis (sessions + cache) │ └──────────────────────────┬───────────────────────────────────────┘ │ AMQP: {message_id, incident_id} ┌──────────────────────────▼───────────────────────────────────────┐ │ RABBITMQ │ │ incident.analysis.queue │ incident.results.queue │ incident.dlq│ └──────────┬────────────────────────────────────────┬─────────────┘ │ consume │ consume results ┌──────────▼──────────────────┐ ┌─────────────▼──────────────┐ │ FASTAPI AI SERVICE │ │ SPRING BOOT result handler │ │ │ │ │ │ 1. Read incident from DB │ │ Writes: agent_runs, │ │ (read-only PG user) │ │ incident_analysis, │ │ 2. Build LangGraph state │◄─────│ recommendations │ │ 3. Run 3-agent pipeline │ │ │ │ 4. Publish results │─────►│ Pushes WebSocket to UI │ └──────────┬──────────────────┘ └─────────────────────────────┘ │ ┌──────────▼──────────────────────────────────┐ │ LANGGRAPH PIPELINE │ │ │ │ Log Analysis Agent ──► raw logs │ │ │ │ │ Root Cause Agent ──► logs + metrics │──► Gemini 2.5 Flash │ │ │ │ Recommendation Agent──► root cause │ └─────────────────────────────────────────────┘ ``` ## 技术栈 | 层级 | 技术 | 用途 | |---|---|---| | Frontend | Next.js 14 + TypeScript | SSR、路由、轮询、WebSocket | | UI | Tailwind CSS + ShadcnUI | 组件库 | | Backend | Spring Boot 3.x (Java 21) | REST API、认证、事件发布 | | AI Service | FastAPI (Python 3.11) | Agent 编排、RabbitMQ consumer | | Agent Framework | LangGraph | 有状态 3-agent 顺序 pipeline | | LLM | Gemini 2.5 Flash | 日志分析、根因分析、建议 | | Message Broker | RabbitMQ 3 | 异步仅引用消息传递 + DLQ | | Database | PostgreSQL 16 | 7 张规范化表，Flyway migrations | | Cache | Redis 7 | JWT 会话、结果缓存、统计缓存 | | Observability | Prometheus + Grafana | 3 个仪表盘，4 条告警规则 | | Containers | Docker + Docker Compose | 8 服务编排 | ## 关键工程决策 **仅引用消息传递** — Spring Boot 仅向 RabbitMQ 发布 `{incident_id}`，绝不发布完整的 payload。FastAPI 使用只读连接直接从 PostgreSQL 获取数据。这使得消息保持较小体积，并消除了传输过程中数据陈旧的风险。 **提交后发布** — RabbitMQ 的发布是通过 `TransactionSynchronizationManager.afterCommit()` 注册的，因此 FastAPI 绝不会在事件行可见之前查询数据库。这消除了消息在事务完全提交之前到达的竞态条件。 **只读数据库用户** — FastAPI 以 `appreader` 身份连接到 PostgreSQL，该用户对三个特定的表仅具有 SELECT 权限。在数据库级别，它没有 INSERT、UPDATE 或 DELETE 权限。 **Pipeline 弹性** — 如果一个 agent 失败，pipeline 会继续执行。结果会以 `analysis_status: PARTIAL` 发布，前端会展示成功执行的 agent 结果。 **至少一次交付** — FastAPI 仅在成功将结果发布回去后才确认 RabbitMQ 消息。Spring Boot 中的一个 `@Scheduled` 重试作业会重新发布任何卡在 PENDING 状态超过 3 分钟的事件。 ## 功能 **事件提交** - 10 字段结构化表单（标题、严重程度、环境、受影响服务、日志、指标） - 带有字段级错误消息的客户端和服务器端验证 - P1 / P2 / P3 / P4 严重程度分类 **AI 分析 Pipeline**（端到端约 25–60 秒） - 日志分析 Agent — 识别错误、异常和重复模式 - 根因 Agent — 将日志和指标综合成具有置信度分数的因果链 - 建议 Agent — 生成 3–5 个按优先级排序、分配给特定团队的修复操作 **实时更新** - Agent 执行时间线，包含每个 agent 的状态、开始时间和持续时间 - 分析完成时通过 WebSocket 推送；每 3 秒轮询作为后备方案 **仪表盘** - 汇总卡片：事件总数、活跃事件、今日已解决、平均分析持续时间 - 可排序、可筛选的事件表，带有严重程度徽章和状态指示器 **可观测性** - Grafana 仪表盘：事件概览、Agent 性能、系统健康状况 - 用于每个 agent 执行时间、LLM 推理持续时间、pipeline 状态的 Prometheus 指标 - 告警：agent 高失败率、DLQ 积压、数据库连接池饱和、API 响应缓慢 ## 前置条件 | 工具 | 版本 | |---|---| | Java (推荐 Amazon Corretto) | 21 | | Maven | 3.9+ | | Node.js (LTS) | 20 | | Python | 3.11 | | Docker Desktop | 最新版 | ## 本地运行 **1. 克隆并配置环境文件** ``` git clone https://github.com/Gaurishkumar/ai-incident-response-team.git cd ai-incident-response-team ``` 复制示例文件并填写你的值： ``` cp infrastructure/.env.example infrastructure/.env cp backend/.env.example backend/.env cp ai-service/.env.example ai-service/.env cp frontend/.env.local.example frontend/.env.local ``` 需要设置的值： - `infrastructure/.env` — 设置 PostgreSQL、RabbitMQ 和 Grafana 的密码 - `backend/.env` — 设置 `JWT_SECRET`（使用 `openssl rand -hex 32` 生成），与上面的密码保持一致 - `ai-service/.env` — 设置 `GEMINI_API_KEY`（可在 [aistudio.google.com](https://aistudio.google.com) 免费获取），与上面的密码保持一致 **2. 构建服务镜像** ``` cd infrastructure docker compose build ``` **3. 首先启动基础设施** ``` docker compose up postgres redis rabbitmq -d ``` 等待直到这三个服务都健康（通过 `docker compose ps` 检查）。 **4. 启动应用服务** ``` docker compose up spring-boot fastapi frontend -d ``` 等待 Spring Boot 打印出 `Started BackendApplication` 日志 — Flyway 会在首次启动时自动创建所有 7 个数据库表。 **5. 启动监控系统** ``` docker compose up prometheus grafana -d ``` **6. 打开应用** | 服务 | URL | 凭证 | |---|---|---| | Frontend | http://localhost:3000 | 注册一个新账号 | | Grafana | http://localhost:3001 | 在 `infrastructure/.env` 中设置 | | RabbitMQ Management | http://localhost:15672 | 在 `infrastructure/.env` 中设置 | | Prometheus | http://localhost:9090 | 无需认证 | | Spring Boot API | http://localhost:8080 | — | | FastAPI | http://localhost:8000/health | — | ## 项目结构 ``` / ├── frontend/ Next.js 14 application (4 pages, 33 components) ├── backend/ Spring Boot 3.x (Java 21, Maven) │ └── src/main/resources/db/migration/ Flyway SQL migrations ├── ai-service/ FastAPI + LangGraph (Python 3.11) │ └── app/ │ ├── agents/ log_analysis, root_cause, recommendation │ ├── pipeline/ LangGraph graph definition │ └── services/ gemini, rabbitmq, database └── infrastructure/ ├── docker-compose.yml ├── db/init.sql PostgreSQL user setup ├── prometheus/ Scrape config + alert rules └── grafana/ Dashboard JSON + provisioning ``` ## 数据库 Schema 跨两个职责域的 7 张规范化表： | 表 | 所有者 | 用途 | |---|---|---| | `users` | Spring Boot | 认证和身份 | | `incidents` | Spring Boot | 核心事件元数据 | | `incident_metrics` | Spring Boot | CPU、内存、错误率、响应时间 | | `incident_logs` | Spring Boot | 原始日志内容（分离以避免事件查询臃肿） | | `agent_runs` | Spring Boot（通过结果处理器） | 3 个 agent 中每个的执行历史 | | `incident_analysis` | Spring Boot（通过结果处理器） | 根因分析结果 | | `recommendations` | Spring Boot（通过结果处理器） | 排序后的修复操作 | FastAPI 对 `incidents`、`incident_metrics` 和 `incident_logs` 仅具有 SELECT 访问权限。它不能读取或写入任何其他表。 ## API 概览除认证路由外，所有 endpoint 都需要 `Authorization: Bearer {token}`。 | 方法 | Endpoint | 描述 | |---|---|---| | POST | `/api/v1/auth/register` | 创建账号 | | POST | `/api/v1/auth/login` | 获取 JWT token | | POST | `/api/v1/auth/logout` | 使会话失效 | | POST | `/api/v1/incidents` | 提交新事件 | | GET | `/api/v1/incidents` | 列出事件（支持分页、筛选） | | GET | `/api/v1/incidents/{id}` | 包含分析的完整事件详情 | | GET | `/api/v1/incidents/{id}/status` | 轻量级轮询 endpoint（2 次查询） | | GET | `/api/v1/dashboard/stats` | 聚合统计（Redis 缓存，2 分钟 TTL） | | WS | `/ws/incidents/{id}` | 分析完成时的推送通知 | ## Grafana 仪表盘 **事件概览** — 随时间变化的事件总数、平均分析持续时间、解决率、按 endpoint 细分的 HTTP 请求率 **Agent 性能** — 每个 agent 的执行时间 (p50/p95/p99)、LLM 推理持续时间、agent 失败总数、活跃 pipeline 计数 **系统健康状况** — 数据库连接池（活跃/空闲/等待）、RabbitMQ 队列深度、API 错误率、DLQ 深度 ## 延期至 V2 的功能根据项目规范，以下内容被有意排除在 V1 的范围之外： - Jira 工单生成 - Slack / PagerDuty 通知 - CloudWatch / Datadog 日志摄入 - Prometheus 告警摄入 - 用于 LLM 提供商配置的设置 UI

标签：AIOps, JS文件枚举, LangGraph, Spring Boot, 微服务架构, 搜索引擎查询, 智能助手, 测试用例, 自动化攻击, 自定义请求头, 请求拦截, 逆向工具