kudesiya/incident-response-toolkit

GitHub: kudesiya/incident-response-toolkit

一套轻量级云原生事件响应工具包，提供事件严重性自动分类、Runbook 生成、MTTR 追踪和复盘模板，帮助 SRE 团队标准化从告警到复盘的完整事件管理流程。

Stars: 0 | Forks: 0

# incident-response-toolkit 适用于云原生平台的实用事件管理工具。这是我在 Fintech、Healthcare 和 EdTech 领域担任 SRE 职务期间不断构建和完善的工具包——这也是我为每个组织反复重建的工具集合。我决定把它放在一个可重用的地方。所有的配置和示例均已脱敏和泛化处理——不包含任何专有或特定于公司的信息。这并不是为了取代 PagerDuty 或 Jira。它主要用于处理介于警报工具和工单系统之间的事务——基于简单的 JSON 日志进行严重性分类、Runbook 生成、复盘以及 MTTR 追踪。 ## 包含内容 ``` classifier/ severity engine — classify incidents by metrics or keywords templates/ runbook, postmortem, comms, oncall handoff templates scripts/ create_incident.py, mttr_calculator.py examples/ sample incident log for the MTTR calculator docs/ severity classification guide ``` ## 快速开始 ``` git clone https://github.com/yourusername/incident-response-toolkit cd incident-response-toolkit pip install pyyaml # 复制并编辑 config cp config.yaml my-config.yaml # dry run — 查看将要创建的内容 python scripts/create_incident.py \ --config my-config.yaml \ --description "elevated 503s on auth service" \ --error-rate 0.08 \ --dry-run # 创建真实的 incident runbook python scripts/create_incident.py \ --config my-config.yaml \ --description "checkout API errors" \ --error-rate 0.12 \ --latency-p99 6000 # 从 incident log 计算 MTTR python scripts/mttr_calculator.py \ --log examples/sample_incidents.json \ --since 2024-01-01 ``` ## 严重性分类器根据指标或关键字将事件分类为 `CRITICAL / HIGH / MEDIUM / LOW`。规则位于 `classifier/rules.yaml` 中——您可以根据自己的 SLO 对其进行修改。默认设置仅作为起点，而非绝对标准。 ``` from classifier.severity import classify # 按 metrics result = classify(error_rate=0.08, latency_p99_ms=5500) print(result.level) # HIGH print(result.matched_rule) # error_rate >= 0.05 (got 0.08) print(result.response_time_minutes) # 30 # 按 alert 中的 keywords result = classify(keywords=["authentication down", "login failing"]) print(result.level) # CRITICAL ``` 规则评估按自上而下的顺序进行，首次匹配生效。如果没有任何匹配项，则默认为 LOW——安全第一。 ## create_incident.py 为新事件生成已填充的 Runbook 文件 + 通知草稿（Slack 消息和电子邮件正文）。 ``` python scripts/create_incident.py \ --config config.yaml \ --description "payment gateway timeouts" \ --error-rate 0.06 \ --output-dir active_incidents ``` 输出： ``` active_incidents/ INC-20240315-1423.md # runbook, ready to fill in INC-20240315-1423_notifications.txt # pre-drafted Slack + email ``` 选项： ``` --description Short incident description --error-rate Current error rate (0.0–1.0) --latency-p99 p99 latency in ms --latency-p95 p95 latency in ms --availability Current availability (0.0–1.0) --keywords Keywords from alert text --dry-run Print output without writing files --output-dir Where to write runbooks (default: active_incidents) ``` ## mttr_calculator.py 从 JSON 格式的事件日志文件中计算 MTTR 和 MTTD。 ``` python scripts/mttr_calculator.py --log examples/sample_incidents.json ``` ``` ================================================== INCIDENT METRICS REPORT ================================================== Overall (8 incidents, 8 resolved) Avg MTTD : 10.4 min Median MTTD: 7.0 min Avg MTTR : 71.4 min Median MTTR: 56.5 min p95 MTTR : 157.0 min By Severity: CRITICAL count=2 avg_mttr=90.5 min median=90.5 min HIGH count=3 avg_mttr=75.7 min median=57.0 min MEDIUM count=2 avg_mttr=47.5 min median=47.5 min LOW count=1 avg_mttr=40.0 min median=40.0 min ``` 事件日志格式（参见 `examples/sample_incidents.json`）： ``` { "id": "INC-20240115-1423", "severity": "HIGH", "description": "...", "started_at": "2024-01-15T14:10:00Z", "detected_at": "2024-01-15T14:23:00Z", "resolved_at": "2024-01-15T15:47:00Z", "root_cause": "...", "postmortem": true } ``` ## 模板 | 模板 | 使用时机 | |----------|-------------| | `incident_runbook.md` | 处理实时事件——追踪时间线、发现结果、沟通记录 | | `postmortem.md` | 解决之后——无指责、聚焦行动项 | | `comms_template.md` | 用于向利益相关者同步信息的 Slack/电子邮件草稿 | | `oncall_handoff.md` | 交接班——进行中的事件、需要关注的事项 | 复盘模板中包含一个大多数模板都会忽略的“幸运之处”部分。我发现这往往包含了最有价值的信号。 ## 配置复制 `config.yaml` 并根据您的团队进行修改。敏感信息（webhook URL、电子邮件分发列表）应通过环境变量提供——配置文件中只需引用环境变量的名称，而不是具体的值。 ``` defaults: service: "my-platform" declared_by: "oncall-eng" notifications: slack: critical_channel: "#incidents-critical" webhook_env_var: "SLACK_WEBHOOK_URL" email: critical_dl: "sre-critical@company.com" ``` ## 依赖要求 ``` python >= 3.8 pyyaml ``` 无其他依赖。这是有意为之的——它应该能在任何地方运行，无需折腾 virtualenv。 ## 许可证 MIT

标签：Homebrew安装, IT运维, MTTR计算, Python, Runbook, Socks5代理, SRE, YAML, 严重等级分类, 事后分析, 事故管理, 偏差过滤, 医疗保健, 可靠性工程, 告警管理, 在线教育, 安全库, 平均恢复时间, 库, 应急响应, 应急处理, 恶意代码分类, 故障复盘, 故障排查, 无后门, 服务可用性, 灾难恢复, 站点可靠性, 运维自动化, 逆向工具, 金融科技