Perun-Engineering/sre-on-call

GitHub: Perun-Engineering/sre-on-call

基于 AWS Bedrock AgentCore 的多 Agent 系统，自动调查 Slack/Discord 上的基础设施告警并生成结构化故障分析报告。

Stars: 3 | Forks: 0

# sre-on-call 一个多 agent 机器人，可自动调查跨 Slack 和 Discord 的基础设施警报。当有警报发布到频道时，系统会对其进行确认，将调查工作并行分发给专门的 agent，并以线程回复的形式发布结构化的故障分析。 ## 架构 ``` Slack/Discord webhook │ ▼ Lambda Adapter ── (signature verify, dedup, classify, ack) ──▶ chat platform │ │ bedrock-agentcore.invoke_agent_runtime (JSON-RPC 2.0 / A2A) ▼ Master Agent ── investigate_alert tool ──▶ Orchestrator │ ┌──────────────────────┼──────────────────────┐ ▼ ▼ ▼ Slack Scanner Discord Scanner CloudWatch Logs EKS (4 specialized agents, configured via config.yaml) ``` - **Lambda Adapter** — 接收 Slack/Discord webhook，验证签名，通过 DynamoDB 进行去重，**对提及进行分类**（在分发之前抑制非警报的闲聊 — 见下文），然后调用 Master Agent 运行时。 - **Master Agent** — 编排调查过程：分发给专门的 agent，强制执行截止时间，发布故障报告。见 [`agents/master/README.md`](agents/master/README.md)。 - **专门的 agent** — 每个数据源一个。每个都有自己的 README： - [Slack Scanner](agents/slack_scanner/README.md) — Slack 频道历史关联。 - [Discord Scanner](agents/discord_scanner/README.md) — Discord 频道历史关联。 - [CloudWatch Logs](agents/cloudwatch_logs/README.md) — Logs Insights 查询。 - [EKS](agents/eks/README.md) — Kubernetes 集群状态。一个 Prometheus agent (`agents/prometheus/`) 已提交，但**未部署，也不在编排器的分发任务中** — 它没有在 `config.yaml` 中列出，也没有相关的 terraform 配置。有关完整的文档索引（部署、测试、架构、设计规范），请参见 [`docs/README.md`](docs/README.md)。所有 agent 均运行在 AWS Bedrock AgentCore Runtime 上，通过 A2A 协议（基于运行时 `/invocations` endpoint 的 JSON-RPC 2.0）进行通信，并使用带有 Claude Haiku 4.5 的 Strands Agents SDK。每个 agent 的工具接口都以声明式描述：[`config.yaml`](config.yaml) 列出了 agent 启用的技能和 MCP server，每个技能都是 `agents//skills//` 下的一个 `SKILL.md` 包，而 `shared.a2a_factory` 是唯一的入口点，负责加载配置、解析技能、打开 MCP 连接并启动 A2A server。 ## 项目结构 ``` ├── config.yaml # Per-agent skills + MCP servers (single source of truth) ├── lambda_adapter/ # Lambda webhook ingestion │ ├── handler.py # Lambda entry point │ ├── intake.py # Dedup + classification gate + master agent invocation │ ├── classifier.py # Alert-vs-chatter classification (heuristics + optional LLM) │ └── dedup.py # DynamoDB deduplication store ├── agents/ │ ├── master/ # Master orchestration agent │ │ ├── tools.py # investigate_alert (single tool, fire-and-forget) │ │ ├── orchestrator.py # InvestigationOrchestrator: fan-out + deadlines │ │ ├── report_formatter.py # Incident report assembly │ │ ├── skills//SKILL.md # Skill bundles (frontmatter -> tool symbol) │ │ ├── tests/test_tools.py # Per-agent unit tests │ │ └── agent_card.json │ ├── slack_scanner/ # tools.py + skills/ + tests/ + agent_card.json │ ├── discord_scanner/ # same layout │ ├── cloudwatch_logs/ # same layout (also wires the aws_docs MCP) │ ├── eks/ # same layout (network_mode: VPC) │ └── prometheus/ # Not deployed; not in config.yaml ├── shared/ # Cross-agent utilities │ ├── models.py # AlertContext, AgentResult, Finding, AgentFailure, AgentMetadata, CommandRequest │ ├── constants.py │ ├── a2a_factory.py # Loads config + skills + MCPs; A2AServer + uvicorn + /ping │ ├── a2a_protocol.py # JSON-RPC envelope build/extract helpers │ ├── agent_telemetry.py # Per-agent metadata footer (model, tokens, cost) │ ├── config.py # ProjectConfig (Pydantic) + loader for config.yaml │ ├── skill_loader.py # SKILL.md parser + tool-symbol resolver │ ├── mcp_loader.py # Context-managed MCPConnections handle │ ├── platforms/ # ChatPlatform per chat platform (Slack, Discord) │ │ ├── __init__.py # Protocol, WebhookEvent tagged union, deliver_with_retry, registry │ │ ├── slack.py # SlackChatPlatform: signature, parse, ack, deliver │ │ └── discord.py # DiscordChatPlatform: signature, parse, ack, deliver │ ├── channel_scan.py # Shared channel-scanning algorithm │ ├── channel_utils.py │ ├── report_renderer.py # MarkupDialect-driven section renderer (Slack mrkdwn, Discord MD) │ ├── secrets.py # Secrets Manager ARN -> plaintext resolver (cached) │ ├── time_utils.py # Investigation window + ISO timestamp helpers │ ├── tool_result.py │ ├── experiment.py │ ├── experiment_store.py │ ├── experiment_results_store.py │ └── trace_store.py # S3 + DDB per-investigation trace archive (fail-open) ├── tests/ # Cross-cutting / shared unit tests │ ├── integration/ # Handler, orchestrator, A2A factory, synthetic webhook │ └── property/ # Hypothesis property-based tests ├── modules/sre-on-call/ # Reusable module (no provider/backend) │ ├── versions.tf # Provider requirements only │ ├── variables.tf # Inputs (incl. config_path, source_root) │ ├── networking.tf # EKS-VPC reference + agent SG │ ├── ecr.tf # ECR repos for the 5 agent images │ ├── dynamodb.tf # Dedup table │ ├── dynamodb_experiments.tf # A/B experiment tables │ ├── secrets.tf # Slack/Discord secret containers │ ├── lambda.tf # Lambda function + URL │ ├── iam.tf # Lambda + agent IAM roles │ ├── iam_agentcore.tf # AgentCore-specific IAM │ ├── agentcore.tf # 5 aws_bedrockagentcore_agent_runtime resources │ ├── traces.tf # S3 trace bucket + DDB index + KMS CMK + IAM grants │ └── observability.tf # CloudWatch alarms + SNS topic for AgentCore ├── examples/complete/ # Reference root: provider + backend + module call │ ├── main.tf # provider + module "sre_on_call" │ ├── outputs.tf # Re-exports module outputs │ └── moved.tf # State re-keying for the old flat root ├── scripts/ │ ├── build_and_push_agents.sh # Build 5 linux/arm64 images and push to ECR │ ├── hydrate_secrets.sh # Push Slack/Discord secret values │ ├── enable_observability.sh # One-time CloudWatch Transaction Search enablement │ └── synthetic_slack_webhook.py # Send a signed synthetic alert to the Lambda URL ├── docs/ │ ├── README.md # Docs index │ ├── deployment.md # Build, deploy, scoped testing │ ├── testing.md # Synthetic + real Slack alert procedures │ ├── architecture.d2 # Source for architecture.svg │ ├── architecture.svg │ ├── icons/ # AWS + vendor icons used by the diagram │ └── superpowers/ # Living design specs and implementation plans ├── CONTEXT.md # Domain vocabulary └── pyproject.toml ``` ## 前置条件 - **Python 3.12+** - **Terraform >= 1.5**（仅用于基础设施部署） - 支持 `linux/arm64` 的 **Docker buildx**（AgentCore 运行时需要 arm64） - **AWS CLI**，并具有针对目标账户的 SSO 或静态凭证 ## 安装说明 ``` git clone cd sre-on-call python3.12 -m venv .venv source .venv/bin/activate pip install -e ".[dev]" python -c "from shared.models import AlertContext; print('OK')" ``` ## 运行测试 ``` pytest # full suite pytest -v # verbose pytest agents/eks/tests/test_tools.py # one agent's unit tests pytest tests/integration/test_orchestrator.py # one integration test pytest tests/property/ # property-based tests only ``` 当前数量：**收集 582 个**，582 个通过。（尽管该 agent 未部署，Prometheus 测试依然运行并通过。） ### 测试布局 - `agents//tests/test_tools.py` — 针对 tool 接口的单 agent 单元测试。 - `tests/` — 针对共享模块（config、skill loader、MCP loader、channel utils、telemetry、dedup、parser、signature、time utils、report formatter、A/B experiments、postmortem command）的跨领域单元测试。 - `tests/integration/` — 主编排器、Lambda handler、A2A factory 以及 synthetic-webhook 签名往返测试。 - `tests/property/` — 针对 parser、signature verifier、dedup、time utils、channel utils、report formatter 和 CloudWatch Logs 查询 helper 的 Hypothesis 基于属性的测试。 ## 基础设施部署完整流程（构建镜像 → ECR 仓库 → terraform apply → 密钥填充）请参见 **[docs/deployment.md](docs/deployment.md)**。概要如下： 1. 配置 AWS profile 和所需的 Terraform 变量。 2. 执行 `terraform apply -target=module.sre_on_call.aws_ecr_repository.agents` 创建仓库。 3. 执行 `./scripts/build_and_push_agents.sh ` 构建并推送 5 个 agent 镜像。 4. 执行 `terraform apply -var "agent_image_tag="` 部署其余部分。 5. 执行 `./scripts/hydrate_secrets.sh` 填充 Slack/Discord 的密钥值。 ### 所需的 Terraform 变量 ``` # examples/complete/terraform.tfvars eks_cluster_name = "eks-uat" # existing cluster the EKS agent inspects agent_container_registry = ".dkr.ecr..amazonaws.com" ``` 可选： | 变量 | 默认值 | 用途 | |----------|---------|---------| | `aws_region` | `us-east-1` | | | `environment` | `dev` | 资源名称前缀 | | `project_name` | `sre-on-call` | 资源名称前缀 | | `agent_image_tag` | `latest` | 在 apply 时固定镜像标签 | | `model_id` | `us.anthropic.claude-haiku-4-5-20251001-v1:0` | 所有 agent 使用的 Bedrock 模型 | | `lambda_memory_size` | `256` | | | `lambda_timeout` | `30` | | ### 配置 Slack 完整的 Slack App + bot 设置请参见 **[docs/testing.md](docs/testing.md)**。简版如下： 1. 将 Event Subscriptions URL 设置为已部署的 Lambda 函数 URL。 2. 仅订阅 **`app_mention`** bot 事件。接入 [分类门控](#alert-classification-gate) 会抑制明显的非警报闲聊，但订阅更广泛的事件仍然会浪费分类器的工作 — 请保持仅订阅 `app_mention`。 3. Bot 权限范围：`app_mentions:read`, `chat:write`, `channels:history`。 4. 使用真实的 Bot Token (`xoxb-…`) 和 Signing Secret 填充密钥。 ## 警报分类门控并非每个 `@bot` 提及都是警报 — 一句随意的“谢谢！”不应触发全面的 agent 分发。Lambda 接入层会在调度前对每个新提及进行分类： - **Tier 1 — 启发式** (`lambda_adapter/classifier.py`，纯粹且确定性)：扫描警报特征标记（严重性关键词、Alertmanager/Grafana 格式化、dashboard/console 链接），以及相反的明显闲聊（问候、确认、单纯的提及）。高置信度的判定在此阶段胜出。 - **Tier 2 — LLM**（可选，`CLASSIFIER_LLM_ENABLED=true`）：一次 Bedrock Converse 轮次（默认为 Haiku）判定 Tier 1 无法定夺的消息。 - **手动覆盖**：在提及中包含 **`investigate`** 一词，可强制进行调查，无需考虑分类结果。该门控是 **fail-open** 的 — 一条模糊的消息、一个禁用/出错的 LLM，或任何意外错误默认都会采取*调查*行动，因此真正的呼叫绝不会被静默丢弃。被门控拦截的提及会收到一条单行的线程内提示，而不是进行分发。使用 `ALERT_CLASSIFICATION_ENABLED=false` 禁用整个门控（Terraform：`enable_alert_classification = false`）。 ## 测试请参见 **[docs/testing.md](docs/testing.md)**。三种方式： - **合成警报** — `scripts/synthetic_slack_webhook.py` 构建一个正确签名的 `app_mention` payload 并 POST 到 Lambda URL。适用于快速冒烟测试。 - **真实的 Slack 警报** — 邀请 bot 加入频道并使用 `@bot …` 触发端到端的调查。 - **`/sre-snapshot` 快照** — 使用带有 `--command /sre-snapshot` 的同一脚本（或在 Slack 注册后于任何频道中运行 `/sre-snapshot`）。发布集群状态、按摄入量排名靠前的 log group 以及聊天平台可达性的顶层快照。 ## 环境变量这些由 Terraform 在 Lambda 函数和 AgentCore 运行时中设置；通常你不需要手动设置它们。 | 变量 | 组件 | 描述 | |----------|-----------|-------------| | `SLACK_SIGNING_SECRET` | Lambda | Secrets Manager 中保存 Slack signing secret 的 ARN | | `SLACK_BOT_TOKEN` | Lambda, Master, Slack Scanner | Secrets Manager 中保存 Slack bot OAuth token 的 ARN | | `DISCORD_PUBLIC_KEY` | Lambda | Secrets Manager 中保存 Discord 应用程序公钥的 ARN | | `DISCORD_BOT_TOKEN` | Lambda, Master, Discord Scanner | Secrets Manager 中保存 Discord bot token 的 ARN | | `DEDUP_TABLE_NAME` | Lambda | DynamoDB 去重表名 | | `ALERT_CLASSIFICATION_ENABLED` | Lambda | 将非警报的提及拦截在分发之外（默认为 `true`；kill-switch） | | `CLASSIFIER_LLM_ENABLED` | Lambda | 为模糊的提及启用 Tier 2 LLM 分类器（默认为 `false`） | | `CLASSIFIER_MODEL_ID` | Lambda | 用于 Tier 2 分类器的 Bedrock 模型（回退至 `MODEL_ID`，然后是 Haiku） | | `EXPERIMENTS_TABLE_NAME` | Lambda | DynamoDB A/B 实验配置表名 | | `MASTER_AGENT_RUNTIME_ARN` | Lambda | 主 agent 的 AgentCore 运行时 ARN | | `TRACES_BUCKET_NAME` | Lambda, Master | 用于每次调查追踪归档的 S3 bucket（可选 — 未设置则禁用追踪） | | `TRACES_TABLE_NAME` | Lambda, Master | 用于追踪归档查找的 DynamoDB 索引表 | | `SLACK_SCANNER_AGENT_RUNTIME_ARN` | Master | Slack Scanner 的 AgentCore 运行时 ARN | | `DISCORD_SCANNER_AGENT_RUNTIME_ARN` | Master | Discord Scanner 的 AgentCore 运行时 ARN | | `CLOUDWATCH_LOGS_AGENT_RUNTIME_ARN` | Master | CloudWatch Logs 的 AgentCore 运行时 ARN | | `EKS_AGENT_RUNTIME_ARN` | Master | EKS 的 AgentCore 运行时 ARN | | `MODEL_ID` | 所有 agent | Bedrock 模型 ID 或跨区域推理配置 | | `EKS_CLUSTER_NAME` | EKS agent | EKS agent 检查的集群 | | `A2A_PORT` / `A2A_HOST` | 所有 agent | A2A server 绑定端口 (9000) / 主机 (0.0.0.0) | 对于 agent 作为普通 HTTP A2A server（而非在 AgentCore 上）运行的本地开发工作，每个 `*_AGENT_RUNTIME_ARN` 都会回退到对应的 `*_AGENT_URL`（例如 `EKS_AGENT_URL=http://localhost:9005`）。如果未设置，编排器将使用 `localhost` 默认值。

标签：AWS Bedrock, EKS, Slack, SRE, 偏差过滤, 告警分诊, 多智能体, 请求拦截, 运维, 逆向工具