ThanhTamSys/incident-response-simulator
GitHub: ThanhTamSys/incident-response-simulator
一个集成故障注入、可观测性监控、SRE Runbook 和事后复盘模板的事件响应混沌工程模拟平台。
Stars: 0 | Forks: 0
# 21 — 事件响应模拟器


## 架构
```
[Target System — Docker Compose]
├── Nginx (reverse proxy, port 80)
├── Flask API (app, port 8000)
└── PostgreSQL (database, port 5432)
[Chaos Layer]
└── inject_chaos.py ← CLI: --scenario
[Observability]
├── Prometheus + Grafana (SLI/SLO dashboards)
├── Alertmanager → Telegram
└── Loki (log search)
[Incident Management]
├── /runbooks/ ← how to diagnose each scenario
└── /postmortems/ ← filled-in incident reports
```
## 混乱场景
| 场景 | 命令 | 作用 |
|---|---|---|
| 数据库延迟 | `--scenario db_latency` | 在 DB 端口上执行 `tc netem delay 500ms` |
| OOM kill | `--scenario oom_kill` | `stress --vm-bytes 500M` → 容器重启 |
| 磁盘打满 | `--scenario disk_full` | `dd if=/dev/zero` 填充至 95% |
| API 错误 | `--scenario api_errors` | 对 30% 的请求返回 500 |
| 网络分区 | `--scenario net_partition` | `iptables` 阻断服务间通信 |
## 快速开始
```
# 启动目标系统
docker compose up -d
# 启动 observability stack
docker compose -f docker-compose.monitoring.yml up -d
# 注入 chaos
python inject_chaos.py --scenario db_latency --duration 120
# 查看 Grafana dashboard
# http://localhost:3000
# 阅读 runbook
cat runbooks/db_latency.md
# 解决并撰写 postmortem
cp postmortems/template.md postmortems/incident-001.md
```
## 项目结构
```
21_incident_response_simulator/
├── target-system/
│ ├── docker-compose.yml # Nginx + Flask + PostgreSQL
│ ├── nginx/nginx.conf
│ ├── app/main.py
│ └── app/Dockerfile
├── monitoring/
│ ├── docker-compose.monitoring.yml
│ ├── prometheus/
│ │ ├── prometheus.yml
│ │ └── rules/
│ │ ├── slo-rules.yaml # SLO alert rules
│ │ └── infra-rules.yaml # CPU, disk, container rules
│ ├── grafana/
│ │ └── dashboards/
│ │ └── slo-dashboard.json
│ └── alertmanager/
│ └── alertmanager.yml # → Telegram
├── chaos/
│ ├── inject_chaos.py # Main CLI tool
│ └── scenarios/
│ ├── db_latency.py
│ ├── oom_kill.py
│ ├── disk_full.py
│ ├── api_errors.py
│ └── net_partition.py
├── runbooks/
│ ├── db_latency.md
│ ├── oom_kill.md
│ ├── disk_full.md
│ ├── api_errors.md
│ └── net_partition.md
├── postmortems/
│ ├── template.md
│ ├── incident-001-db-latency.md # Filled example
│ └── incident-002-oom-kill.md # Filled example
└── README.md
```
## SLO 定义
```
# Availability SLO:7 天内达到 99.5%
# Latency SLO:p95 < 200ms
# Error Rate SLO:< 1% 5xx 错误
```
## Runbook 模板
```
## Runbook:[Scenario Name]
**Severity:** P1 / P2 / P3
**Detection:** Alert "[Alert Name]" fired
### Step 1:确认问题
kubectl get pods -n app
kubectl logs --previous
### Step 2:确定根本原因
...
### Step 3:缓解
...
### Step 4:验证
...
### Step 5:Post-Incident
- Create postmortem within 24h
```
## 交付物检查清单
- [x] 包含 5 个场景的 `inject_chaos.py` CLI
- [x] Grafana SLO 仪表盘(可用性、延迟、错误率)
- [x] `/runbooks/` 中的 5 个 runbook
- [x] Postmortem 模板 + 2 个填写完整的示例
- [x] 针对所有 5 个场景的 Alertmanager → Telegram 告警
- [x] 演示视频:注入 → 告警 → 解决 → postmortem
标签:API集成, Docker, Kubernetes, SRE, 偏差过滤, 可观测性, 安全防御评估, 故障演练, 测试用例, 混沌工程, 版权保护, 自定义请求头, 请求拦截, 运维监控, 逆向工具