ThanhTamSys/incident-response-simulator

GitHub: ThanhTamSys/incident-response-simulator

一个集成故障注入、可观测性监控、SRE Runbook 和事后复盘模板的事件响应混沌工程模拟平台。

Stars: 0 | Forks: 0

# 21 — 事件响应模拟器 ![状态](https://img.shields.io/badge/Status-Completed-green?style=flat) ![技术栈](https://img.shields.io/badge/Stack-Python_|_Docker_|_Prometheus_|_Streamlit-blue?style=flat) ## 架构 ``` [Target System — Docker Compose] ├── Nginx (reverse proxy, port 80) ├── Flask API (app, port 8000) └── PostgreSQL (database, port 5432) [Chaos Layer] └── inject_chaos.py ← CLI: --scenario [Observability] ├── Prometheus + Grafana (SLI/SLO dashboards) ├── Alertmanager → Telegram └── Loki (log search) [Incident Management] ├── /runbooks/ ← how to diagnose each scenario └── /postmortems/ ← filled-in incident reports ``` ## 混乱场景 | 场景 | 命令 | 作用 | |---|---|---| | 数据库延迟 | `--scenario db_latency` | 在 DB 端口上执行 `tc netem delay 500ms` | | OOM kill | `--scenario oom_kill` | `stress --vm-bytes 500M` → 容器重启 | | 磁盘打满 | `--scenario disk_full` | `dd if=/dev/zero` 填充至 95% | | API 错误 | `--scenario api_errors` | 对 30% 的请求返回 500 | | 网络分区 | `--scenario net_partition` | `iptables` 阻断服务间通信 | ## 快速开始 ``` # 启动目标系统 docker compose up -d # 启动 observability stack docker compose -f docker-compose.monitoring.yml up -d # 注入 chaos python inject_chaos.py --scenario db_latency --duration 120 # 查看 Grafana dashboard # http://localhost:3000 # 阅读 runbook cat runbooks/db_latency.md # 解决并撰写 postmortem cp postmortems/template.md postmortems/incident-001.md ``` ## 项目结构 ``` 21_incident_response_simulator/ ├── target-system/ │ ├── docker-compose.yml # Nginx + Flask + PostgreSQL │ ├── nginx/nginx.conf │ ├── app/main.py │ └── app/Dockerfile ├── monitoring/ │ ├── docker-compose.monitoring.yml │ ├── prometheus/ │ │ ├── prometheus.yml │ │ └── rules/ │ │ ├── slo-rules.yaml # SLO alert rules │ │ └── infra-rules.yaml # CPU, disk, container rules │ ├── grafana/ │ │ └── dashboards/ │ │ └── slo-dashboard.json │ └── alertmanager/ │ └── alertmanager.yml # → Telegram ├── chaos/ │ ├── inject_chaos.py # Main CLI tool │ └── scenarios/ │ ├── db_latency.py │ ├── oom_kill.py │ ├── disk_full.py │ ├── api_errors.py │ └── net_partition.py ├── runbooks/ │ ├── db_latency.md │ ├── oom_kill.md │ ├── disk_full.md │ ├── api_errors.md │ └── net_partition.md ├── postmortems/ │ ├── template.md │ ├── incident-001-db-latency.md # Filled example │ └── incident-002-oom-kill.md # Filled example └── README.md ``` ## SLO 定义 ``` # Availability SLO：7 天内达到 99.5% # Latency SLO：p95 < 200ms # Error Rate SLO：< 1% 5xx 错误 ``` ## Runbook 模板 ``` ## Runbook：[Scenario Name] **Severity:** P1 / P2 / P3 **Detection:** Alert "[Alert Name]" fired ### Step 1：确认问题 kubectl get pods -n app kubectl logs --previous ### Step 2：确定根本原因 ... ### Step 3：缓解 ... ### Step 4：验证 ... ### Step 5：Post-Incident - Create postmortem within 24h ``` ## 交付物检查清单 - [x] 包含 5 个场景的 `inject_chaos.py` CLI - [x] Grafana SLO 仪表盘（可用性、延迟、错误率） - [x] `/runbooks/` 中的 5 个 runbook - [x] Postmortem 模板 + 2 个填写完整的示例 - [x] 针对所有 5 个场景的 Alertmanager → Telegram 告警 - [x] 演示视频：注入 → 告警 → 解决 → postmortem

标签：API集成, Docker, Kubernetes, SRE, 偏差过滤, 可观测性, 安全防御评估, 故障演练, 测试用例, 混沌工程, 版权保护, 自定义请求头, 请求拦截, 运维监控, 逆向工具