Wil1112002/Intelligent-SRE-Platform

GitHub: Wil1112002/Intelligent-SRE-Platform

这是一个基于 AWS EKS 的端到端站点可靠性工程平台,集成了基础设施自动化置备、全链路可观测性以及基于 Claude API 的 AI 智能事件分诊与响应机制。

Stars: 0 | Forks: 0

# 智能 SRE 平台 一个基于 AWS EKS 的端到端站点可靠性工程(SRE)平台,展示了基础设施置备、可观测性、AI 辅助事件响应和运维工具。 ## 架构 ``` graph TB subgraph AWS subgraph VPC subgraph EKS Cluster subgraph ns-default ["default namespace"] API[api-service] Worker[worker-service] end subgraph ns-monitoring ["monitoring namespace"] Prom[Prometheus] Graf[Grafana] Loki[Loki] PT[Promtail] AM[Alertmanager] end subgraph ns-sre-agent ["sre-agent namespace"] Agent[AI Triage Agent] DB[(SQLite)] end Ingress[NGINX Ingress] end end ECR[ECR] S3[S3 - Logs/State] end User[User / SRE] -->|srekit CLI| Agent User -->|browser| Graf API -->|metrics| Prom Worker -->|metrics| Prom PT -->|logs| Loki Prom -->|alerts| AM AM -->|webhook| Agent Agent -->|query metrics| Prom Agent -->|query logs| Loki Agent -->|triage| Claude[Claude API] Agent -->|remediate| API Agent -->|remediate| Worker Agent -->|store| DB Graf -->|datasource| Prom Graf -->|datasource| Loki Ingress --> API Ingress --> Worker Ingress --> Graf ``` ## 前置条件 - 已配置 CLI 的 AWS 账户 - Terraform >= 1.5 - kubectl - Helm >= 3.12 - Python >= 3.12 - Docker ## 快速开始 ``` # 克隆 repo git clone https://github.com/YOUR_USER/intelligent-sre-platform.git cd intelligent-sre-platform # 为 Terraform state 创建 S3 bucket aws s3 mb s3://intelligent-sre-platform-tfstate --region us-east-1 # 预置 EKS cluster cd terraform terraform init terraform apply cd .. # 配置 kubectl aws eks update-kubeconfig --name intelligent-sre-dev --region us-east-1 # 部署 observability stack bash k8s/observability/install.sh # 构建并推送 service images 到 ECR(或在本地部署) docker build -t api-service services/api-service/ docker build -t worker-service services/worker-service/ docker build -t ai-agent ai-agent/ # 通过 Helm 部署 services helm upgrade --install api-service k8s/apps/api-service/ -n default helm upgrade --install worker-service k8s/apps/worker-service/ -n default # 创建 API key secret 并部署 AI agent kubectl create secret generic anthropic-api-key \ --from-literal=api-key=YOUR_KEY -n sre-agent helm upgrade --install ai-agent k8s/apps/ai-agent/ -n sre-agent # 安装 CLI pip install -e srekit/ # 开始生成 traffic bash services/load-generator.sh ``` ## srekit CLI 用法 ``` # Cluster health overview srekit scan srekit scan --namespace monitoring --output json # Security and best-practice audit srekit audit srekit audit --namespace default --export report.md # Incident management srekit incident list srekit incident list --severity critical --status open srekit incident get 42 # AI-powered queries srekit incident ask "what caused the most incidents this week?" # Weekly report generation srekit report ``` ## 项目结构 ``` . ├── terraform/ # EKS cluster, VPC, IAM (Terraform) ├── k8s/ │ ├── apps/ # Helm charts: api-service, worker-service, ai-agent │ ├── observability/ # Prometheus, Grafana, Loki Helm values + dashboards │ └── alertmanager/ # AlertmanagerConfig routing ├── services/ │ ├── api-service/ # FastAPI sample service A │ ├── worker-service/ # FastAPI sample service B │ └── load-generator.sh # Traffic generator script ├── ai-agent/ # Alert webhook receiver + Claude triage logic ├── srekit/ # Python CLI (Typer + Rich) ├── docs/ │ ├── architecture.md │ └── runbooks/ └── .github/workflows/ # CI + deploy pipelines ``` ## 可观测性 | 工具 | 用途 | 访问方式 | |------|---------|--------| | Prometheus | 指标采集 + 告警 | `kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090` | | Grafana | 仪表盘 + 可视化 | `kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80` | | Loki | 日志聚合 | 通过 Grafana 数据源访问 | | Alertmanager | 告警路由 | `kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093` | ### 预配置仪表盘 - **服务概览** — 请求速率、错误率、p50/p95/p99 延迟、Worker 内存 - **基础设施** — 节点 CPU/内存/磁盘、Pod 数量、重启次数、网络 I/O - **事件视图** — 错误率与日志面板并排显示 ### 告警规则 | 告警 | 条件 | 严重级别 | |-------|-----------|----------| | HighErrorRate | 2 分钟内错误率 > 5% | Warning | | VeryHighErrorRate | 2 分钟内错误率 > 15% | Critical | | HighLatency | 5 分钟内 p99 > 1s | Warning | | PodCrashLooping | 10 分钟内重启 > 3 次 | Critical | | MemoryLeak | 10 分钟内内存增长 > 20% | Warning | ## AI 分诊代理 当告警触发时,AI 代理会: 1. 接收来自 Alertmanager 的 webhook 2. 查询 Prometheus 获取最近的错误率、p99 延迟、Pod 重启次数 3. 查询 Loki 获取最近的错误日志 4. 将结构化上下文发送到 Claude API 进行根因分析 5. 将分诊结果保存到 SQLite 6. 可选地执行安全修复(滚动重启、扩容) ## 许可证 MIT
标签:AIOps, Alertmanager, API集成, AWS, Docker, DPI, ECS, EKS, Grafana, Helm, Loki, NGINX Ingress, Python, SQLite, SRE, Terraform, 偏差过滤, 可观测性, 子域名突变, 安全防御评估, 容器编排, 持续运维, 故障诊断, 无后门, 日志聚合, 智能运维, 监控告警, 自动化运维, 自定义请求头, 请求拦截, 逆向工具