nanthansr/mlops-fraud-pipeline

GitHub: nanthansr/mlops-fraud-pipeline

一个以欺诈检测为场景的 MLOps 流水线，解决 ML 模型上线后的数据漂移检测、性能监控与告警响应问题。

Stars: 0 | Forks: 0

# MLOps 欺诈检测流水线 ![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/5ad36bd5a1233423.svg) ![Python](https://img.shields.io/badge/python-3.11-blue) ![FastAPI](https://img.shields.io/badge/FastAPI-0.135-009688) ![License](https://img.shields.io/badge/license-MIT-green) [![在线演示](https://img.shields.io/badge/Live_Demo-Try_it-6366f1)](https://yieldai-n8n.duckdns.org/fraud-api/demo/) ML 模型在生产环境中会发生无声的退化。特征分布发生偏移，预测置信度产生漂移，而当有人注意到时，模型可能已经持续出错数周了。本项目就是检测层。它是一个完整的 MLOps 流水线——但关键部分在于部署*之后*发生的事情：每次预测都有 Prometheus 指标记录、按计划执行 Evidently 漂移检查、当模型行为改变时触发的 Grafana 告警规则，以及证明告警系统切实有效的故障模拟。欺诈检测模型只是载体。可观测性技术栈才是核心。 ## 工作原理 ``` graph LR A[Kaggle Dataset] --> B[Feature Engineering] B --> C[XGBoost Model] C --> D[MLflow Registry] D --> E[FastAPI Service] E --> F[GitHub Actions CI/CD] E --> G[Prometheus] F --> H[Docker Deploy] G --> I[Grafana Dashboard] I --> J[Drift Detection] J --> K[Email Alert] ``` **推理路径**：交易 → FastAPI 验证 schema → XGBoost 预测 → Prometheus 计数器递增 → 在约 2ms 内返回结果。 **监控路径**：Prometheus 抓取指标 → Grafana 评估规则 → 批量 Evidently 漂移检查推送到 Pushgateway → 在欺诈激增或特征漂移时触发告警。 ## 项目结构 ``` mlops-fraud-pipeline/ ├── src/ │ ├── api/ │ │ ├── main.py # FastAPI app (predict, health, metrics, demo) │ │ └── monitoring.py # Prometheus custom metrics │ └── model/ │ └── train.py # XGBoost training with MLflow tracking ├── demo/ │ └── index.html # Interactive demo page ├── scripts/ │ ├── download_data.py # Kaggle dataset download │ ├── drift_check.py # Evidently drift detection → Pushgateway │ ├── fetch_model.py # Pull champion model from MLflow registry │ ├── generate_reference.py # Create drift baseline from training data │ └── simulate_incident.py # Fraud spike & distribution shift simulation ├── tests/ │ └── test_api.py # API endpoint tests (7 tests) ├── docker/ │ ├── prometheus.yml # Prometheus scrape config │ └── grafana/provisioning/ # Grafana datasources, dashboards, alerts ├── docs/ │ ├── architecture.md # Architecture narrative │ ├── what-i-learned.md # Personal reflection on building this │ └── incident-simulation.md # Incident response documentation ├── data/processed/ │ └── reference.csv # 5000-row drift baseline ├── Dockerfile # Production image ├── docker-compose.yml # Full local stack ├── requirements.txt └── .github/workflows/ci-cd.yml # Test → Build → Deploy pipeline ``` ## 技术选型（及原因） | 层级 | 技术 | 为什么不用替代方案 | |-------|------------|------------------------| | 模型 | 使用 `scale_pos_weight=577` 的 XGBoost | 在 577:1 的不平衡状态下，逻辑回归的召回率严重下降。PCA 空间中的非线性特征交互非常重要。 | | 主要指标 | PR-AUC（而非 ROC-AUC） | 一个将所有交易预测为正常的模型在此数据集上的 ROC-AUC 得分为 0.97。而 PR-AUC 能暴露这一问题。 | | API | FastAPI + Uvicorn | 对于 MLOps 交接而言，强类型的请求 schema 和自动生成的 OpenAPI 文档是必不可少的。 | | 漂移检测 | Evidently DataDriftPreset | 手写的统计测试意味着代码测试不足且阈值不一致。 | | 实验跟踪 | 带 `@champion` 别名的 MLflow 3.x | 仅限本地的 artifact 无法支持模型晋升、回滚或 CI 集成。 | | 监控 | Prometheus + Grafana + Pushgateway | CloudWatch 无法将请求指标、模型指标和批量漂移指标统一到一个仪表板中。 | | 部署 | Docker + Oracle Cloud Always Free | 0 美元/月，始终保持在线（无冷启动），4 个 ARM CPU，24GB 内存。 | ## 真正重要的部分 Grafana 告警规则通过 `docker/grafana/provisioning/alerting/` 进行配置： | 规则 | 条件 | 为什么选择此阈值 | |------|-----------|-------------------| | 欺诈率激增 | `rate(fraud[5m]) / rate(total[5m]) > 0.005` 持续 1 分钟 | 基线为 0.17%。0.5% 约为基线的 3 倍——高到足以避免测试流量的噪音，低到足以捕捉到真实的模型退化。 | | 数据漂移警告 | `drift_share_of_drifted_columns > 0.5` 立即触发 | 如果超过一半的特征同时发生漂移，原因必然是上游数据更改，而不是单一特征的随机方差。 | 这两条规则都会路由到通过 SMTP 配置的电子邮件联系点。故障模拟（`scripts/simulate_incident.py`）证明了这些告警确实会触发——这不是理论，而是经过测试的。完整说明请参见 [docs/incident-simulation.md](docs/incident-simulation.md)。 ## 本地运行 ``` # 克隆并进入 git clone https://github.com/nanthansr/mlops-fraud-pipeline cd mlops-fraud-pipeline # 下载数据（需要 Kaggle API key） python scripts/download_data.py # 训练模型 python src/model/train.py # 启动 full stack（API + Prometheus + Grafana） docker compose up --build # API 文档 open http://localhost:8000/docs # Demo 页面 open http://localhost:8000/demo # Grafana dashboard open http://localhost:3000 # admin / admin # 运行测试 pytest tests/ -v ``` ### 模拟故障 ``` python scripts/simulate_incident.py --incident fraud_spike python scripts/simulate_incident.py --incident distribution_shift ``` 详细的故障响应文档请参见 [docs/incident-simulation.md](docs/incident-simulation.md)。 ## 面试时我会解释的要点 | 决策 | 原因 | 我放弃了什么 | |----------|-----|-----------------| | 以 PR-AUC 作为主要指标 | 正类仅占 0.17%——ROC-AUC 会掩盖糟糕的欺诈召回率 | 准确率、仅看 ROC-AUC | | 在 CI 中从 MLflow 拉取 champion 模型 | 部署的是经过审查的 artifact，而非临时的重新训练 | 每次推送都进行训练 | | 每次预测使用一个 S3 对象 | S3 没有安全的追加操作；分区对象支持重放和漂移检测 | 原地追加文件 | | 使用 Pushgateway 处理批量漂移指标 | 漂移任务生命周期短；Prometheus 无法在退出后进行抓取 | 长期运行的虚假 exporter | | 欺诈告警持续 1 分钟 | 约为基线的 3 倍可减少噪音；持续时间窗口避免了测试流量造成的误报 | 极低的瞬时阈值 | | 使用 OIDC 进行 CI 到 AWS 的身份验证 | 每个作业使用短期的 STS token；无需存储访问密钥 | GitHub secrets 中的 IAM 密钥 | | 使用 `github.sha` 标记镜像 | 完全的可追溯性——每个镜像都映射到一个确切的 commit | `latest` 标签 | 关于最大的挑战、我会做出的改变以及本项目刻意避免做的事情，更深入的反思请参见 [docs/what-i-learned.md](docs/what-i-learned.md)。 ## 数据集 Kaggle 上的 [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)： - 284,807 笔交易，包含 492 起欺诈案例（0.17%） - 特征 V1–V28（PCA 转换）+ Amount + Time - 目标：0 = 正常，1 = 欺诈 ## 作者 Nanthan Srikumar · [LinkedIn](https://www.linkedin.com/in/nanthan-sr/) · [GitHub](https://github.com/nanthansr) ## 许可证 [MIT](LICENSE)

标签：Apex, API集成, AV绕过, FastAPI, MLOps, XGBoost, 可观测性, 数据漂移检测, 机器学习, 自定义请求头, 请求拦截, 逆向工具