rohitsalesforce132/ai-incident-response

GitHub: rohitsalesforce132/ai-incident-response

一个面向生产环境的AI事件响应框架,以流程化处置实现快速遏制、根因定位与修复。

Stars: 0 | Forks: 0

# AI 事件响应 — 生产故障管理 ## 架构 ``` AI Production Failure Detected │ ├─► Immediate Containment │ ├─ Kill switch: disable affected feature │ ├─ Fallback routing: human review / static answers │ ├─ Stakeholder alerting (product, legal, PR, leadership) │ └─ Evidence preservation (prompts, traces, model version) │ ├─► Assessment & Triage │ ├─ Impact scope (users affected, data exposed) │ ├─ Severity classification (critical/high/medium) │ └─ Root cause category (model/data/orchestration/tool) │ ├─► Root Cause Analysis │ ├─ Reproduce with similar inputs │ ├─ Inspect full decision chain traces │ ├─ Review recent changes (model/prompt/data) │ └─ Check for behavioral drift │ ├─► Remediation │ ├─ Short-term: guardrails, filter patterns, prompt hardening │ ├─ Rollback: revert suspected changes │ ├─ Data fix: remove/correct bad retrieval data │ └─ Model fix: fine-tune or switch models │ ├─► Prevention │ ├─ Add regression evals │ ├─ Strengthen monitoring alerts │ ├─ Harden guardrails │ └─ Update runbooks │ └─► Communication ├─ Internal postmortem ├─ External communication (with legal) └─ User notification ``` ## 测试套件 ``` pip install -r requirements.txt python -m pytest tests/ -q # 55+ tests ``` ## 关键指标 - 遏制时间:<5 分钟 - 根因定位时间:<2 小时 - 修复时间:<24 小时 - 复发率:0% ## 许可证 MIT
标签:AIOps, AI事件响应, AI生产故障, AI运维, MLOps, pytest, Python, 严重性分级, 人工审核, 关键指标, 即时遏制, 回归测试, 安全规则引擎, 影响范围评估, 护栏加固, 提示词工程, 提示词硬化, 故障分类, 故障恢复, 数据修复, 数据清洗, 无后门, 时间到修复, 时间到包含, 时间到根因, 根因分析, 模型回滚, 模型版本追踪, 沟通与通报, 法律公关, 流量回退, 生产环境故障管理, 监控告警, 策略决策点, 管道编排, 行为漂移检测, 证据保存, 逆向工具, 零复发