bubakry/chaos-testing

GitHub: bubakry/chaos-testing

一个面向事件响应和可靠性演练的混沌工程项目，通过可控故障注入配合完整的运维文档模板，帮助团队在本地或 AWS 上端到端练习从告警到复盘的全流程。

Stars: 0 | Forks: 0

# chaos-testing [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/2b76b6ce46123756.svg)](https://github.com/bubakry/chaos-testing/actions/workflows/ci.yml) [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Terraform](https://img.shields.io/badge/Terraform-1.9-7B42BC?logo=terraform&logoColor=white)](https://www.terraform.io/) [![AWS](https://img.shields.io/badge/AWS-ECS%20Fargate-FF9900?logo=amazonaws&logoColor=white)](https://aws.amazon.com/fargate/) [![Prometheus](https://img.shields.io/badge/Prometheus-monitoring-E6522C?logo=prometheus&logoColor=white)](https://prometheus.io/) [![Node.js](https://img.shields.io/badge/Node.js-20-339933?logo=node.js&logoColor=white)](https://nodejs.org/) ## 为什么会有这个项目大多数“部署服务”的项目在健康检查通过后就停止了。真正的可靠性工作在服务健康**之后**才开始：警报必须触发，仪表盘必须易读，运维手册必须得到执行，而且 on-call 人员必须知道该做什么。这个仓库为你提供了一个故意设计得脆弱的 API，以及相应的操作脚手架，以便端到端地练习这个循环。 ## 核心亮点 - **故障注入 API** — 错误率、延迟、依赖中断、内存压力和 CPU 饱和都可以通过一个简单的 `POST` 请求开启。 - **两种运行方式** — 通过 `docker compose` 运行本地技术栈以便快速演示，或者使用 AWS ECS Fargate + ALB + CloudWatch 警报 + SNS 进行贴近实战的演练。 - **账号防护 Terraform** — 设置 `EXPECTED_ACCOUNT_ID` 后，如果你认证到了错误的 AWS 账号， apply 将会失败。 - **不仅是代码，更是运维产物** — 运维手册、on-call 工作流、告警→预案映射、复盘模板、交接模板、沟通模板。 - **端到端自动化** — `aws_full_automation.sh` 会执行部署、生成基线流量、注入错误风暴、检查警报、恢复，并可选择销毁整个技术栈。 - **Prometheus 监控** — 使用 `prom-client` 提供请求计数、错误计数和延迟直方图指标；告警规则和 Alertmanager 路由配置已包含在仓库中。 ## 演练日（Game-day）循环 ``` flowchart LR Eng[On-call engineer] -- "POST /chaos/error-rate" --> API[Chaos API
Node.js + prom-client] API -- "/metrics" --> Prom[Prometheus] Prom -- alert rules fire --> AM[Alertmanager] AM -- webhook --> Hook[Mock on-call webhook] Eng -- follows --> RB[Runbook
runbooks/] Eng -- "POST /chaos/reset" --> API API -- recovery --> Prom Prom -- alert resolves --> AM Eng -- writes --> PM[Postmortem
templates/] ``` ## AWS 参考架构 ``` flowchart LR GH[Local laptop] -- "docker buildx --platform linux/amd64" --> ECR[(ECR)] ECR --> ECS[ECS Fargate task
chaos-api] Internet[Internet] --> ALB[ALB :80] ALB --> ECS ECS -- logs --> CWL[CloudWatch Logs] ECS -- metrics --> CW[CloudWatch Metrics] CW -- p95 latency / 5xx --> CWA[CloudWatch Alarms] CWA --> SNS[SNS topics] SNS --> Email[primary + secondary on-call email] ``` ## 快速开始 ### 本地模式 ``` docker compose up --build # 在另一个终端中： curl -s http://localhost:8080/healthz curl -s http://localhost:8080/chaos/state # 运行 baseline 场景 BASE_URL=http://localhost:8080 ./scripts/chaos_scenarios.sh baseline ``` Prometheus 位于 `http://localhost:9090`，Alertmanager 位于 `http://localhost:9093`。 ### AWS 模式 ``` export AWS_REGION=us-east-1 export PROJECT_NAME=chaos-game-day export ENVIRONMENT=demo export EXPECTED_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) ./scripts/deploy_to_aws.sh # builds, pushes ECR, applies Terraform ./scripts/aws_full_automation.sh # full game-day workflow ./scripts/destroy_aws_stack.sh # tear it back down ``` 针对 CloudShell 环境的变体（`scripts/cloudshell_full_automation.sh`）跳过了本地 Docker 构建，并直接从现有的 ECR 仓库拉取镜像。有关手动验证清单和常见失败情况，请参见 [AWS_TEST_NOTES.md](AWS_TEST_NOTES.md)。 ## Chaos 端点 | 方法 | 路径 | 效果 | | --- | --- | --- | | `POST` | `/chaos/error-rate` | 按指定百分比注入随机 5xx 错误 | | `POST` | `/chaos/latency` | 增加固定延迟（毫秒） | | `POST` | `/chaos/dependency` | 模拟下游依赖中断 | | `POST` | `/chaos/memory` | 保留内存块以推高 RSS | | `POST` | `/chaos/cpu` | 运行消耗 CPU 的死循环 | | `POST` | `/chaos/reset` | 清除所有 chaos 状态 | | `GET` | `/chaos/state` | 检查当前的 chaos 配置 | | `GET` | `/healthz` | 存活状态 | | `GET` | `/readyz` | 就绪状态 | | `GET` | `/api/orders` · `/api/payments` · `/api/notifications` | 示例业务端点 | | `GET` | `/metrics` | Prometheus 指标输出 | ## 仓库结构 ``` chaos-testing/ ├── app/ # Node.js chaos API + Dockerfile ├── aws/ # Terraform: ECR, ECS Fargate, ALB, CW alarms, SNS ├── scripts/ # Shell automation: deploy, run scenarios, alarm checks ├── prometheus/ # prometheus.yml + alert-rules.yml ├── alertmanager/ # alertmanager.yml routing ├── runbooks/ # on-call workflow, alert→playbook map, game-day checklist ├── templates/ # postmortem, handoff, incident comms templates ├── docker-compose.yml # local stack └── AWS_TEST_NOTES.md # manual validation playbook ``` ## 运维产物 - [`runbooks/on-call-workflow.md`](runbooks/on-call-workflow.md) — on-call 工程师在被呼叫时应该怎么做。 - [`runbooks/alert-playbook-map.md`](runbooks/alert-playbook-map.md) — 告警 → 响应运维手册对照表。 - [`runbooks/game-day-checklist.md`](runbooks/game-day-checklist.md) — 计划演练的前/中/后检查清单。 - [`templates/postmortem-template.md`](templates/postmortem-template.md) — 无责复盘骨架。 - [`templates/oncall-handoff-template.md`](templates/oncall-handoff-template.md) — 换班交接。 - [`templates/incident-communications-template.md`](templates/incident-communications-template.md) — 活跃事件期间的内部/外部沟通。 ## 技术栈 - **应用** — Node.js 20, Express, prom-client - **本地技术栈** — Docker, Docker Compose, Prometheus, Alertmanager - **云端** — AWS ECR, ECS Fargate, Application Load Balancer, CloudWatch Alarms, SNS - **IaC** — Terraform 1.9 - **自动化** — Bash, AWS CLI v2 ## 我构建这个项目的初衷我想练习完整的运维循环，而不仅仅是部署。目标是打造一个紧凑的项目，让我可以在其中部署服务、注入故障、验证告警、遵循运维手册并记录恢复过程，就像处理真实事件一样—— 无论是在笔记本电脑上还是在真实的 AWS 账户中。 ## 许可证 [MIT](LICENSE)。

标签：API集成, AWS, CISA项目, Docker, DPI, ECS, ECS Fargate, GNU通用公共许可证, MITM代理, Node.js, Terraform, 可观测性, 基础设施即代码(IaC), 安全防御评估, 弹性测试, 故障注入, 混沌工程, 演练与复盘, 灾备演练, 监控告警, 站点可靠性工程(SRE), 自定义请求头, 请求拦截, 运维自动化, 高可用架构