jordann6/azure-incident-responder

GitHub: jordann6/azure-incident-responder

基于 Azure Monitor、n8n 工作流和 Claude Haiku 的自动化事件响应流水线，实现从告警检测到自动修复与 Slack 通知的端到端闭环。

Stars: 0 | Forks: 0

# Azure 事件响应器这是 [aws-incident-responder](https://github.com/jordann6/aws-incident-responder) 的 Azure 对应版本。这是一个自动化的事件响应流水线，其中的 runbook 是一个 n8n 工作流，而不是充当连接作用的代码。目标 VM 上的 Azure Monitor 指标警报通过操作组 webhook 触发运行在 Azure Container Apps 上的 n8n。该工作流会要求 Claude Haiku 提供一份纯英文的事件摘要，将事件卡片发布到 Slack，通过 Azure Management API 重启 VM，然后重新检查 CPU 指标，以决定将事件标记为已解决还是进行升级处理。 ## 架构 ![架构](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/9e5bd668ca230910.png) ## 工作原理 ``` Target VM Percentage CPU >= 80% (5 min) -> Azure Monitor metric alert -> Action group (common alert schema webhook) -> n8n on Azure Container Apps (managed HTTPS FQDN) n8n runbook: parse common alert schema -> Claude Haiku incident summary -> Slack incident card -> restart VM (Azure Management API, OAuth2 client credentials) -> wait 90s -> query Percentage CPU metric -> resolved -> Slack resolved still high -> Slack escalate ``` ## 为什么这与 AWS 构建不同这两个项目刻意建立在不同的工具链上，以展示跨云平台的适用范围，就像 [aws-developer-platform](https://github.com/jordann6/aws-developer-platform) 和 [azure-developer-platform](https://github.com/jordann6/azure-developer-platform) 的区别一样： | 关注点 | AWS 构建 | Azure 构建 | |---|---|---| | 检测 | CloudWatch 警报 | Azure Monitor 指标警报 | | 传递 | SNS HTTPS 订阅（带确认握手） | 操作组 webhook（直接，通用警报 schema） | | TLS endpoint | ALB + ACM 证书 + Route 53 子域名 | Container Apps 托管 FQDN，开箱即用的有效证书 | | n8n 托管 | ALB 后面的 ECS Fargate | Azure Container Apps | | 修复授权 | 范围限定的 IAM 用户 + SigV4 | Entra service principal + OAuth2 客户端凭据 | | 修复操作 | `RebootInstances`，通过 DescribeAlarms 验证 | VM `restart`，通过 Percentage CPU 指标验证 | 两者保持相同的形态：一个可视化的、受版本控制的 runbook 以及一个 Claude Haiku 摘要，使得 Slack 卡片读起来就像一份值班笔记。 ## 组件 | 层级 | 资源 | 角色 | |---|---|---| | 检测 | **VM 目标** (Standard_B1s, Ubuntu 22.04) | 演示工作负载；在演示中通过 `az vm run-command` 拉高 CPU | | 检测 | **指标警报** | 在 5 分钟窗口内 `Percentage CPU >= 80%`，每分钟评估一次 | | 路由 | **操作组** | 将通用警报 schema 发布到 n8n webhook | | 控制平面 | **Container Apps** | 在托管的 HTTPS FQDN 上运行 `n8nio/n8n`（即 runbook 引擎） | | Runbook | **Claude Haiku** | 生成事件摘要和建议的下一步操作 | | Runbook | **Slack** | 向 `#incidents` 发送事件、已解决和升级消息 | | 修复 | **Service principal** | 仅作用于目标 VM 的 `Virtual Machine Contributor` | | IaC | **Terraform** | RG, VNet, VM, Container Apps, 警报, 操作组；azurerm 远程状态 | ## 前置条件 - Terraform >= 1.6，Azure CLI 已登录到订阅 `9c644a73-5dc1-4bfe-9e90-91865014cdd2` - azurerm 状态后端 (`rg-tfbackend-jordprojs` / `sttfbejordprojs8557` / `tfstate`) - Anthropic API 密钥以及具有 `chat:write` 权限的 Slack bot token ## 部署 n8n 密钥由 Terraform (`random_password`) 生成，并且仅存在于私有状态后端中，因此没有前置步骤： ``` cd terraform terraform init terraform apply ``` 记下输出内容 (`n8n_url`, `n8n_webhook_endpoint`, `target_vm_id`, `target_vm_name`, `alert_name`)。获取 n8n 密码： ``` terraform output -raw n8n_basic_auth_password ``` ### 创建限定范围的 service principal n8n 在最小权限身份下重启 VM，该身份是带外创建的，因此不会在 Terraform 状态中遗留任何密钥： ``` az ad sp create-for-rbac \ --name "incident-responder-n8n" \ --role "Virtual Machine Contributor" \ --scopes "$(terraform output -raw target_vm_id)" ``` ### 配置 n8n 1. 打开 `n8n_url`，使用 `admin` 和上述密码登录。 2. 导入 `workflows/incident-responder-azure.json`。 3. 将 Azure 凭据添加为 **OAuth2 (client credentials)**： - Access Token URL: `https://login.microsoftonline.com//oauth2/v2.0/token` - Client ID / Secret: 来自 `az ad sp` 的输出 (`appId` / `password`) - Scope: `https://management.azure.com/.default` 4. 将 Anthropic 凭据添加为 HTTP Header Auth：header 为 `x-api-key`，值为您的 API key。 5. 添加 Slack 凭据（bot token）并确认 bot 已在 `#incidents` 中。 6. 激活工作流。操作组已指向 webhook，因此无需额外的配置。 ## 验证拉高目标 VM 的 CPU 以触发警报： ``` az vm run-command invoke \ --name "$(terraform output -raw target_vm_name)" \ --resource-group rg-incident-responder \ --command-id RunShellScript \ --scripts "sudo apt-get update -y && sudo apt-get install -y stress-ng && stress-ng --cpu 0 --timeout 600s" ``` 在警报窗口内，指标警报会触发，runbook 将端到端运行。确认以下几点： - Slack `#incidents` 显示了带有 Haiku 摘要的事件卡片，随后是已解决或升级的消息。 - n8n 执行日志显示了完整路径，包括重启和指标查询。 - 该 VM 在 Azure portal 活动日志中显示最近的重新启动。 ## 销毁 ``` cd terraform terraform destroy az ad sp delete --id "$(az ad sp list --display-name incident-responder-n8n --query '[0].appId' -o tsv)" ``` 销毁过程很干净：单个资源组容纳了 Terraform 创建的所有内容，而 service principal 会单独被移除。 ## 成本运行时大约为 `$0.05/hour`（包含一个 0.5 vCPU 的 Container App，一个 Standard_B1s VM，Log Analytics 数据摄取），大约为 `$1/day`。其构建旨在同一天内部署、演示并销毁，总成本远低于一美元。 ## 安全加固说明 - 在生产环境中，Container App 将使用具有联合凭据的托管标识，而不是 service principal 密钥，从而消除长期有效的客户端密钥。 - n8n 在 Container App 的临时存储上运行 SQLite。为了在各个修订版本之间保持持久化，请挂载 Azure Files 或将 n8n 指向托管的 Postgres。 - 生产环境的 runbook 会在重启之前添加一个批准环节，并在升级之前设定最大重试预算。

标签：Azure, DLL 劫持, ECS, n8n, Terraform, 大语言模型, 自动化运维