Wil1112002/Intelligent-SRE-Platform
GitHub: Wil1112002/Intelligent-SRE-Platform
这是一个基于 AWS EKS 的端到端站点可靠性工程平台,集成了基础设施自动化置备、全链路可观测性以及基于 Claude API 的 AI 智能事件分诊与响应机制。
Stars: 0 | Forks: 0
# 智能 SRE 平台
一个基于 AWS EKS 的端到端站点可靠性工程(SRE)平台,展示了基础设施置备、可观测性、AI 辅助事件响应和运维工具。
## 架构
```
graph TB
subgraph AWS
subgraph VPC
subgraph EKS Cluster
subgraph ns-default ["default namespace"]
API[api-service]
Worker[worker-service]
end
subgraph ns-monitoring ["monitoring namespace"]
Prom[Prometheus]
Graf[Grafana]
Loki[Loki]
PT[Promtail]
AM[Alertmanager]
end
subgraph ns-sre-agent ["sre-agent namespace"]
Agent[AI Triage Agent]
DB[(SQLite)]
end
Ingress[NGINX Ingress]
end
end
ECR[ECR]
S3[S3 - Logs/State]
end
User[User / SRE] -->|srekit CLI| Agent
User -->|browser| Graf
API -->|metrics| Prom
Worker -->|metrics| Prom
PT -->|logs| Loki
Prom -->|alerts| AM
AM -->|webhook| Agent
Agent -->|query metrics| Prom
Agent -->|query logs| Loki
Agent -->|triage| Claude[Claude API]
Agent -->|remediate| API
Agent -->|remediate| Worker
Agent -->|store| DB
Graf -->|datasource| Prom
Graf -->|datasource| Loki
Ingress --> API
Ingress --> Worker
Ingress --> Graf
```
## 前置条件
- 已配置 CLI 的 AWS 账户
- Terraform >= 1.5
- kubectl
- Helm >= 3.12
- Python >= 3.12
- Docker
## 快速开始
```
# 克隆 repo
git clone https://github.com/YOUR_USER/intelligent-sre-platform.git
cd intelligent-sre-platform
# 为 Terraform state 创建 S3 bucket
aws s3 mb s3://intelligent-sre-platform-tfstate --region us-east-1
# 预置 EKS cluster
cd terraform
terraform init
terraform apply
cd ..
# 配置 kubectl
aws eks update-kubeconfig --name intelligent-sre-dev --region us-east-1
# 部署 observability stack
bash k8s/observability/install.sh
# 构建并推送 service images 到 ECR(或在本地部署)
docker build -t api-service services/api-service/
docker build -t worker-service services/worker-service/
docker build -t ai-agent ai-agent/
# 通过 Helm 部署 services
helm upgrade --install api-service k8s/apps/api-service/ -n default
helm upgrade --install worker-service k8s/apps/worker-service/ -n default
# 创建 API key secret 并部署 AI agent
kubectl create secret generic anthropic-api-key \
--from-literal=api-key=YOUR_KEY -n sre-agent
helm upgrade --install ai-agent k8s/apps/ai-agent/ -n sre-agent
# 安装 CLI
pip install -e srekit/
# 开始生成 traffic
bash services/load-generator.sh
```
## srekit CLI 用法
```
# Cluster health overview
srekit scan
srekit scan --namespace monitoring --output json
# Security and best-practice audit
srekit audit
srekit audit --namespace default --export report.md
# Incident management
srekit incident list
srekit incident list --severity critical --status open
srekit incident get 42
# AI-powered queries
srekit incident ask "what caused the most incidents this week?"
# Weekly report generation
srekit report
```
## 项目结构
```
.
├── terraform/ # EKS cluster, VPC, IAM (Terraform)
├── k8s/
│ ├── apps/ # Helm charts: api-service, worker-service, ai-agent
│ ├── observability/ # Prometheus, Grafana, Loki Helm values + dashboards
│ └── alertmanager/ # AlertmanagerConfig routing
├── services/
│ ├── api-service/ # FastAPI sample service A
│ ├── worker-service/ # FastAPI sample service B
│ └── load-generator.sh # Traffic generator script
├── ai-agent/ # Alert webhook receiver + Claude triage logic
├── srekit/ # Python CLI (Typer + Rich)
├── docs/
│ ├── architecture.md
│ └── runbooks/
└── .github/workflows/ # CI + deploy pipelines
```
## 可观测性
| 工具 | 用途 | 访问方式 |
|------|---------|--------|
| Prometheus | 指标采集 + 告警 | `kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090` |
| Grafana | 仪表盘 + 可视化 | `kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80` |
| Loki | 日志聚合 | 通过 Grafana 数据源访问 |
| Alertmanager | 告警路由 | `kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093` |
### 预配置仪表盘
- **服务概览** — 请求速率、错误率、p50/p95/p99 延迟、Worker 内存
- **基础设施** — 节点 CPU/内存/磁盘、Pod 数量、重启次数、网络 I/O
- **事件视图** — 错误率与日志面板并排显示
### 告警规则
| 告警 | 条件 | 严重级别 |
|-------|-----------|----------|
| HighErrorRate | 2 分钟内错误率 > 5% | Warning |
| VeryHighErrorRate | 2 分钟内错误率 > 15% | Critical |
| HighLatency | 5 分钟内 p99 > 1s | Warning |
| PodCrashLooping | 10 分钟内重启 > 3 次 | Critical |
| MemoryLeak | 10 分钟内内存增长 > 20% | Warning |
## AI 分诊代理
当告警触发时,AI 代理会:
1. 接收来自 Alertmanager 的 webhook
2. 查询 Prometheus 获取最近的错误率、p99 延迟、Pod 重启次数
3. 查询 Loki 获取最近的错误日志
4. 将结构化上下文发送到 Claude API 进行根因分析
5. 将分诊结果保存到 SQLite
6. 可选地执行安全修复(滚动重启、扩容)
## 许可证
MIT
标签:AIOps, Alertmanager, API集成, AWS, Docker, DPI, ECS, EKS, Grafana, Helm, Loki, NGINX Ingress, Python, SQLite, SRE, Terraform, 偏差过滤, 可观测性, 子域名突变, 安全防御评估, 容器编排, 持续运维, 故障诊断, 无后门, 日志聚合, 智能运维, 监控告警, 自动化运维, 自定义请求头, 请求拦截, 逆向工具