njseeber1/incident-response-toolkit

GitHub: njseeber1/incident-response-toolkit

incidentctl 是一款命令行工具，用于简化工程团队的事故管理流程，从创建到复盘，提升MTTR计算效率。

Stars: 0 | Forks: 0

# incidentctl 面向工程团队的端到端事故管理命令行工具：创建事故、生成事后复盘模板、跟踪解决过程，并衡量整个值班轮换中的平均解决时间（MTTR）。大多数事故管理工具要么过于笨重（Jira工作流、自定义仪表板），要么过于随意（一个会消失的Slack线程）。`incidentctl` 介于两者之间——一个轻量级、可版本控制、终端原生的工作流，让您的事故记录与代码库紧密相连。 ## 为何事故管理工具至关重要每个工程团队最终都会学到同样深刻的教训：问题不在于事故本身，而在于缺乏结构化的应对流程。当凌晨2点发生P0级故障时，值班工程师最不该做的就是从Confluence复制事后复盘模板或思考在哪里记录时间线。这种认知负荷在压力下会累积，并降低响应和复盘的质量。优秀的事故管理工具会强化三个习惯，这些习惯能将表现优异的团队与其他团队区分开来： 1. **一致的文档记录** - 每个事故在创建时（而非解决后）就获得一个事后复盘模板。 2. **带时间戳的时间线** - 事故期间持续记录发生了什么以及何时发生，而非48小时后凭记忆重建。 3. **量化的MTTR** - 无法改进您未度量的事物。按严重等级划分的MTTR能告诉您下一次可靠性投资应该放在哪里。 `incidentctl` 使这三件事都变得毫不费力。 ## 功能特性 - **`incidentctl new`** - 一条命令即可创建YAML事故记录并生成填充好的事后复盘模板 - **`incidentctl list`** - 渲染一个整洁的Rich表格，显示所有事故，带有严重性颜色编码和状态 - **`incidentctl resolve`** - 标记事故为已解决，记录时间戳，并自动计算TTR - **`incidentctl report`** - 生成包含严重性细分和逐事故解决时间的MTTR摘要 - **基于YAML存储** - 事故数据以纯文本文件形式存在，可与代码库一起进行版本控制 - **Jinja2事后复盘模板** - 完全可定制的Markdown模板，所有必要章节均已预先填充 ## 安装说明 **从源码安装（推荐用于作品集评审）：** ``` git clone https://github.com/njseeber1/incident-response-toolkit.git cd incident-response-toolkit pip install -e . ``` **从PyPI安装（发布后）：** ``` pip install incidentctl ``` **要求：** Python 3.10 或更高版本。 ## 使用说明 ### 创建新事故 ``` incidentctl new --severity P0 --title "Primary database cluster unresponsive" --owner sarah.chen ``` ``` ╭─ Incident Created ────────────────────────────────────────╮ │ ID: INC-20260410-a3f2 │ │ Severity: P0 │ │ Title: Primary database cluster unresponsive │ │ Owner: sarah.chen │ │ Created: 2026-04-10T02:14:00 UTC │ │ │ │ Incident file: incidents/INC-20260410-a3f2.yaml │ │ Post-mortem: postmortems/INC-20260410-a3f2.md │ ╰───────────────────────────────────────────────────────────╯ ``` 此命令会立即创建两个文件： - `incidents/INC-20260410-a3f2.yaml` - 包含时间线的结构化事故记录 - `postmortems/INC-20260410-a3f2.md` - 预先填充好的、可供填写的事后复盘模板 ### 列出所有事故 ``` incidentctl list ``` ``` ╭──────────────────────────────────────────────────────────────────────────────────────────╮ │ Incidents │ ├──────────────────────┬──────────┬──────────┬────────────────────────────┬──────────┬────┤ │ ID │ Severity │ Status │ Title │ Owner │ .. │ ├──────────────────────┼──────────┼──────────┼────────────────────────────┼──────────┼────┤ │ INC-20260410-db01 │ P0 │ resolved │ Primary database cluster.. │ sarah.ch │ .. │ │ INC-20260418-auth01 │ P1 │ resolved │ Auth service elevated err. │ marcus.o │ .. │ │ INC-20260423-api01 │ P2 │ open │ API gateway p99 latency d. │ priya.na │ .. │ ╰──────────────────────┴──────────┴──────────┴────────────────────────────┴──────────┴────╯ Total: 3 incident(s) ``` P0级别行会以粗体红色显示，P1为红色，P2为黄色，P3为青色。开放状态的事故显示为红色，已解决状态为绿色。 ### 解决事故 ``` incidentctl resolve INC-20260423-api01 --note "Rolled back v2.14.1. Latency normalized." ``` ``` ╭─ Incident Resolved ───────────────────╮ │ ID: INC-20260423-api01 │ │ Status: resolved │ │ Resolved: 2026-04-23T11:02:00 UTC │ │ TTR: 1h 21m │ ╰───────────────────────────────────────╯ ``` ### 生成报告 ``` incidentctl report ``` ``` ╭─ Incident Report ────────────────────────╮ │ Total incidents: 3 │ │ Open: 0 │ │ Resolved: 3 │ │ Avg MTTR: 1h 56m │ ╰──────────────────────────────────────────╯ Breakdown by Severity Severity Count Bar P0 1 █ P1 1 █ P2 1 █ Time to Resolution ID Severity Title TTR INC-20260410-db01 P0 Primary database cluster.. 2h 33m INC-20260418-auth01 P1 Auth service elevated err. 1h 35m INC-20260423-api01 P2 API gateway p99 latency d. 1h 21m ``` 按严重性筛选： ``` incidentctl report --severity P0 ``` ## 项目结构 ``` incident-response-toolkit/ ├── incidentctl/ │ ├── __init__.py │ └── cli.py # All CLI commands (new, list, resolve, report) ├── templates/ │ ├── incident.yaml.j2 # Jinja2 template for incident YAML files │ └── postmortem.md.j2 # Jinja2 post-mortem template ├── incidents/ # YAML incident records (version-controllable) │ ├── INC-20260410-db01.yaml # Sample P0 - database outage │ ├── INC-20260418-auth01.yaml # Sample P1 - auth degradation │ └── INC-20260423-api01.yaml # Sample P2 - API latency ├── postmortems/ # Generated and completed post-mortem markdown │ ├── INC-20260410-db01.md │ └── INC-20260418-auth01.md ├── pyproject.toml ├── requirements.txt └── README.md ``` ## 事后复盘模板生成的事后复盘模板（`templates/postmortem.md.j2`）包含所有重要章节： - **摘要** - 2-4句话的执行概述 - **时间线** - 从事故数据中预先填充的、带时间戳的事件日志 - **根本原因** - 主要原因及促成因素 - **影响** - 涵盖持续时间、受影响用户和SLA违反状态的结构化表格 - **做得好的方面 / 做得不好的方面** - 坦诚的复盘章节 - **行动项** - 包含负责人和截止日期的优先级表格 - **经验教训** - 面向未来的系统性改进请查看 `postmortems/` 目录中的示例事后复盘，了解完成文档的样貌。 ## 事故YAML模式每个事故都存储为一个纯YAML文件： ``` id: INC-20260418-auth01 title: "Auth service elevated error rate - OAuth token validation failures" severity: P1 status: resolved owner: marcus.obi created_at: "2026-04-18T14:03:00" resolved_at: "2026-04-18T15:38:00" postmortem: "postmortems/INC-20260418-auth01.md" tags: - auth - oauth - redis timeline: - time: "2026-04-18T14:03:00" event: "Alert fired: auth-service error rate exceeded 5% threshold." - time: "2026-04-18T14:35:00" event: "Mitigation applied: Redis maxmemory increased to 8GB." ``` 由于事故是纯YAML文件，因此可以提交到您的代码仓库，在代码审查中进行差异比较，并使用标准工具进行搜索。 ## 扩展此工具 `incidentctl` 刻意保持精简。适用于生产环境的实际扩展： - **PagerDuty集成** - 根据PD的webhook自动创建事故 - **Slack通知** - 在 `new` 和 `resolve` 时向 `#incidents` 频道发布消息 - **GitHub Issues同步** - 将每个事故链接到一个跟踪issue - **CSV/JSON导出** - 将MTTR数据导入您的可靠性仪表板 - **自定义严重性等级** - 将P0-P3等级体系调整为适合您组织的定义 Jinja2模板和YAML模式的设计宗旨是无需修改CLI代码即可进行自定义。 ## 许可证 MIT。详情请参见 [许可证](LICENSE)。 *由 [Nick Seeber](https://github.com/njseeber1) 构建。欢迎反馈和PR。*

标签：LangChain, YAML, 事件管理, 事件解决, 事件跟踪, 事后分析, 可靠性工程, 安全可观测性, 安全库, 工具链, 工程团队, 平均修复时间, 故障管理, 数据测量, 时间线记录, 服务可靠性, 版本控制, 站点可靠性工程, 终端原生, 终端表格, 网络调试, 自动化, 轻量级, 运维, 逆向工具