rohitsalesforce132/ai-incident-response

GitHub: rohitsalesforce132/ai-incident-response

一个面向生产环境的AI事件响应框架，以流程化处置实现快速遏制、根因定位与修复。

Stars: 0 | Forks: 0

# AI 事件响应 — 生产故障管理 ## 架构 ``` AI Production Failure Detected │ ├─► Immediate Containment │ ├─ Kill switch: disable affected feature │ ├─ Fallback routing: human review / static answers │ ├─ Stakeholder alerting (product, legal, PR, leadership) │ └─ Evidence preservation (prompts, traces, model version) │ ├─► Assessment & Triage │ ├─ Impact scope (users affected, data exposed) │ ├─ Severity classification (critical/high/medium) │ └─ Root cause category (model/data/orchestration/tool) │ ├─► Root Cause Analysis │ ├─ Reproduce with similar inputs │ ├─ Inspect full decision chain traces │ ├─ Review recent changes (model/prompt/data) │ └─ Check for behavioral drift │ ├─► Remediation │ ├─ Short-term: guardrails, filter patterns, prompt hardening │ ├─ Rollback: revert suspected changes │ ├─ Data fix: remove/correct bad retrieval data │ └─ Model fix: fine-tune or switch models │ ├─► Prevention │ ├─ Add regression evals │ ├─ Strengthen monitoring alerts │ ├─ Harden guardrails │ └─ Update runbooks │ └─► Communication ├─ Internal postmortem ├─ External communication (with legal) └─ User notification ``` ## 测试套件 ``` pip install -r requirements.txt python -m pytest tests/ -q # 55+ tests ``` ## 关键指标 - 遏制时间：<5 分钟 - 根因定位时间：<2 小时 - 修复时间：<24 小时 - 复发率：0% ## 许可证 MIT

标签：AIOps, AI事件响应, AI生产故障, AI运维, MLOps, pytest, Python, 严重性分级, 人工审核, 关键指标, 即时遏制, 回归测试, 安全规则引擎, 影响范围评估, 护栏加固, 提示词工程, 提示词硬化, 故障分类, 故障恢复, 数据修复, 数据清洗, 无后门, 时间到修复, 时间到包含, 时间到根因, 根因分析, 模型回滚, 模型版本追踪, 沟通与通报, 法律公关, 流量回退, 生产环境故障管理, 监控告警, 策略决策点, 管道编排, 行为漂移检测, 证据保存, 逆向工具, 零复发