shubhamaher8/OpsAgent

GitHub: shubhamaher8/OpsAgent

一个基于 AI 的自主事件响应 Agent，能在检测到生产故障后自动诊断根因、执行修复并通知 Slack，减少常见告警的值班介入。

Stars: 0 | Forks: 0

# 🚨 AI 事件响应 Agent 一个自主 agent，可检测 ECS 服务事件，通过 AI 诊断根本原因，自动修复问题，并将 RCA 报告发送至 Slack。 ## 🔄 工作原理 ``` flowchart LR A["🔔 CloudWatch Alarm"] --> B["📨 SNS Topic"] B --> C["⚡ Lambda Agent"] C --> D["📊 Fetch Logs & Metrics"] D --> E["🧠 AI Diagnosis"] E --> F{"Confidence > 0.75?"} F -->|Yes| G["🛠️ Auto Fix
Restart / Scale"] F -->|No| H["📢 Escalate to Human"] G --> I["💬 Slack Report"] H --> I style A fill:#ff6b6b,stroke:#333,color:#fff style C fill:#4ecdc4,stroke:#333,color:#fff style E fill:#ffe66d,stroke:#333,color:#000 style G fill:#95e1d3,stroke:#333,color:#000 style I fill:#a8d8ea,stroke:#333,color:#000 ``` ## 📁 项目结构 ``` ai-incident-agent/ ├── lambda_handler.py # Lambda entry point — parses SNS events ├── agent.py # Core logic — diagnose, act, report ├── tools.py # AWS integrations (logs, metrics, ECS) ├── prompts.py # LLM prompt templates ├── slack.py # Slack notifications ├── Dockerfile # Lambda container image ├── requirements.txt # Python dependencies └── .env # Environment variables (not committed) ``` ## ⚡ 快速开始 ### 1. 克隆与设置 ``` git clone https://github.com/shubhamaher8/OpsAgent.git cd OpsAgent/ai-incident-agent cp .env.example .env # 编辑 .env 填入你的 API keys ``` ### 2. 环境变量 ``` GOOGLE_API_KEY=your-gemini-key # From Google AI Studio SLACK_WEBHOOK_URL=https://hooks.slack.com/... AWS_REGION=us-east-1 ECS_CLUSTER=your-cluster-name LOG_GROUP_PREFIX=/ecs/ ``` ### 3. 构建与部署 ``` # Build docker build --platform linux/amd64 --provenance=false --sbom=false -t ai-incident-agent . # Push 到 ECR aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin .dkr.ecr.us-east-1.amazonaws.com docker tag ai-incident-agent:latest .dkr.ecr.us-east-1.amazonaws.com/ai-incident-agent:latest docker push .dkr.ecr.us-east-1.amazonaws.com/ai-incident-agent:latest ``` 然后使用 ECR 镜像在 AWS Console 中创建 Lambda 函数。 ## 📱 Slack 通知 **✅ 自动解决：** ``` 🚨 Incident: payment-service | P2 Root Cause: Memory leak from unclosed DB connection Action: restart (confidence: 0.91) Status: ✅ Resolved ``` **⚠️ 已升级处理：** ``` 🚨 Incident: payment-service | P2 Root Cause: Intermittent network timeout Action: None — confidence too low (0.62) Status: ⚠️ Escalated to on-call engineer ``` ## 🔒 安全性 - 仅允许 `restart` 和 `scale` 操作 - 置信度必须 > 0.75 才能自动执行 - 低置信度 → 升级交由人工处理 - 所有失败情况 → Slack 通知 ## 作者 **Shubham Aher** ## 许可证 MIT License — 查看 [LICENSE](LICENSE) 文件。

标签：AIOps, AWS, DPI, Serverless, 模块化设计, 自动化修复, 请求拦截, 运维, 逆向工具