Duks31/fraud-detection-platform

GitHub: Duks31/fraud-detection-platform

一个生产级实时欺诈检测 MLOps 平台，通过自动化特征工程、模型训练与在线服务基础设施，实现端到端的机器学习流水线编排与低延迟预测。

Stars: 0 | Forks: 0

# Sentinel：实时欺诈检测 MLOps 平台 [![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![Docker](https://img.shields.io/badge/docker-compose-blue.svg)](https://docs.docker.com/compose/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT/) ![Airflow UI](https://raw.githubusercontent.com/Duks31/fraud-detection-platform/master/schematics/airflow.png) ![Streamlit Dashboard](https://raw.githubusercontent.com/Duks31/fraud-detection-platform/master/schematics/dashboard.png) ## 目录 - [概述](#overview) - [架构](#architecture) - [技术栈](#tech-stack) - [项目状态](#project-status) - [快速开始](#quick-start) - [项目结构](#project-structure) - [功能](#features) - [API 端点](#api-endpoints) - [开发](#development) - [路线图](#roadmap) - [故障排除](#troubleshooting) - [贡献](#contributing) - [许可证](#license) ## 概述 Sentinel 是一个完整的 MLOps 平台，展示了用于欺诈检测的生产级机器学习基础设施。它展示了： - **自动化 ML Pipeline**：由 Apache Airflow 编排的每日模型再训练 - **Feature Store**：基于 Feast 的特征管理，包含离线 (Parquet) 和在线存储 - **实验跟踪**：使用 MLflow 进行模型版本控制和指标跟踪 - **实时服务**：FastAPI 端点提供亚秒级预测 - **可扩展基础设施**：使用 Docker Compose 的微服务架构 **用例**：包含 50,000+ 笔交易的信用卡交易欺诈检测在 [medium](https://medium.com/@chidubemndukwe/beyond-the-notebook-architecting-a-real-time-mlops-platform-for-fraud-detection-38dbf523aec4) 上阅读完整的技术文章 ## 架构 ### 系统架构 ![Architecture](https://static.pigsec.cn/wp-content/uploads/repos/cas/3a/3ad9a40d75e2343245d1f2b33e8de65024b5b224b676bb6e6553804c0bd829fe.png) #### 图例 | 颜色 | 层级 | 组件 | |-------|-------|------------| | 蓝色 | 数据接入与展示 | 原始数据，Streamlit Dashboard | | 绿色 | Feature Store 与服务 | Feast, Redis, FastAPI | | 琥珀色 | 编排 | Apache Airflow | | 粉色/玫瑰色 | 模型管理 | MLflow, MinIO | | 灰色 | 持久化 | PostgreSQL | ## 技术栈 | 组件 | 技术 | 用途 | |-----------|-----------|---------| | **编排** | Apache Airflow 2.7.1 | 工作流自动化与调度 | | **Feature Store** | Feast 0.31.1 | 特征工程与服务 | | **实验跟踪** | MLflow | 模型版本控制与指标 | | **对象存储** | MinIO | 兼容 S3 的 artifact 存储 | | **在线存储** | Redis | 低延迟特征服务 | | **数据库** | PostgreSQL 15 | 元数据持久化 | | **API** | FastAPI | 实时预测服务 | | **仪表板** | Streamlit | 可视化与监控 | | **基础设施** | Docker Compose | 容器编排 | | **机器学习框架** | scikit-learn | 模型训练 | ### 数据库架构：共享的 PostgreSQL 容器 (`sentinel_db`) 查看正在运行的容器时，一个常见的问题是：即使名为 "sentinel_db" 的逻辑数据库并未被显式激活，为什么还要使用 `sentinel_db` 容器？为了节省资源，我们使用单个 PostgreSQL 容器 (`sentinel_db`) 作为整个 MLOps pipeline 的中央元数据主干。在启动期间，`init-db.sql` 脚本会在此容器内为核心工具动态分配隔离的逻辑数据库： * **`airflow_db`**：存储 Apache Airflow 的编排元数据（DAG 定义、任务状态、RBAC 凭据）。 * **`feast_registry`**：作为 Feast Feature Store 的中央 SQL 注册表，保持离线和在线存储同步。 * **MLflow Tracking**：使用主 Postgres 数据库来跟踪实验运行、超参数以及模型注册表状态。 ## 项目状态 ### 可用功能 - [x] **基础设施**：9 个 Docker 容器在编排网络中运行 - [x] **Feature Store**：包含 PostgreSQL 注册表和 Redis 在线存储的 Feast - [x] **ML Pipeline**：自动化的 3 阶段 Airflow DAG - 任务 1：应用特征定义 - 任务 2：智能化物化（全量/增量） - 任务 3：模型训练与 MLflow 日志记录 - [x] **模型存储**：artifact 持久化存储在 MinIO 中 - [x] **API 服务**：带有健康检查的 FastAPI 端点 - [x] **特征服务**：从 Redis 实时检索特征（约 50K 个特征） ### 进行中 - [ ] **模型性能**：基线 RandomForest（97.7% 准确率，需要调优） - [ ] **监控**：Prometheus + Grafana 集成 - [ ] **测试**：单元和集成测试覆盖率 - [ ] **文档**：全面的设置指南 ### 已知限制 - 模型召回率为 35%（需要超参数调优和特征工程） - 暂无数据漂移检测 - 单模型服务（无 A/B 测试） - 需要手动创建存储桶（在设置中未自动化） ## 快速开始 ### 前置条件 - Docker Desktop (20.10+) - Docker Compose (2.0+) - Python 3.10+（用于本地开发） - 最低 8GB RAM - 20GB 磁盘空间 ### 安装 **自动设置（推荐）** ``` # 克隆 repository git clone https://github.com/Duks31/fraud-detection-platform.git cd fraud-detection-platform # 运行自动化设置 ./setup.sh # 等待完成（约 5 分钟） # 按照屏幕上的说明操作 # 完成后拆除 infrastructure chmod +x teardown.sh ./teardown.sh ``` **手动设置** 1. **克隆仓库** ``` git clone https://github.com/Duks31/fraud-detection-platform.git cd fraud-detection-platform ``` 2. **配置环境变量** ``` cd infrastructure cp .env.example .env # Edit .env with your credentials (or use defaults for local dev) ``` 3. **启动基础设施** ``` docker compose up -d ``` 等待约 60 秒以启动所有服务。验证： ``` docker ps # Should show 9 running containers ``` 4. **创建 MinIO 存储桶**（一次性设置） ``` cd .. conda activate fdp # or your virtual environment python create_bucket.py ``` 5. **触发 ML pipeline** - 打开 Airflow UI：http://localhost:8080 - 登录：`admin` / `admin` - 启用并触发 DAG：`sentinel_mlops_pipeline` - 等待约 5-10 分钟完成（所有 3 个任务都应变绿） 6. **测试 API** ``` # Health check curl http://localhost:8000/health # Prediction curl http://localhost:8000/predict/2987000 # Expected output: # {"transaction_id":2987000,"is_fraud":true,"fraud_probability":0.71,"status":"Success"} ``` ## 项目结构 ``` fraud-detection-platform/ ├── airflow_dags/ # Airflow DAG definitions │ └── sentinal_retraining_dag.py │ ├── infrastructure/ # Docker & orchestration configs │ ├── docker-compose.yaml # Service definitions (8 containers) │ ├── airflow.Dockerfile # Custom Airflow image with Feast │ ├── Dockerfile # MLflow server image │ ├── init-db.sql # PostgreSQL initialization script │ ├── .env.example # Environment variables template │ └── .env # Actual credentials (gitignored) │ ├── feature_store/ # Feast feature definitions │ ├── feature_store.yaml # Feast configuration (PostgreSQL + Redis) │ ├── definitions.py # Feature view definitions │ └── preprocess_data.py # Data cleaning script │ ├── serving_api/ # FastAPI serving application │ ├── main.py # API endpoints & model loading │ ├── requirements.txt # API dependencies │ └── Dockerfile # API container image │ ├── dashboard/ # Streamlit visualization │ ├── dashboard.py # Dashboard implementation │ └── Dockerfile # Dashboard container image │ ├── data/ # Training datasets │ ├── train_transaction_clean.parquet # Preprocessed training data (50K rows) │ ├── train_transaction.csv # Original Kaggle dataset │ ├── train_transaction.parquet # Intermediate format │ ├── test_transaction.csv # Test set │ ├── train_identity.csv # Identity features │ ├── test_identity.csv # Test identity features │ ├── sample_submission.csv # Kaggle submission format │ └── scratch.ipynb # Exploratory analysis │ ├── tests/ # Test suite │ └── test_main.py # API unit tests │ ├── .github/ # CI/CD workflows │ └── workflows/ │ └── main.yml # GitHub Actions pipeline │ ├── monitoring/ # Monitoring configs (TODO) │ ├── train_model.py # Model training script (executed by Airflow) ├── create_bucket.py # MinIO bucket initialization ├── convert_data.py # Data format conversion utilities ├── verify_setup.sh # System health check script ├── mlflow.db # Local MLflow metadata (for development) ├── README.md # This file ├── .gitignore # Git ignore patterns └── .vscode/ # VS Code workspace settings ``` ### 关键文件说明 | 文件 | 用途 | |------|---------| | `train_model.py` | 由 Airflow 调用的主训练脚本。从 Feast 加载特征，训练 RandomForest，记录到 MLflow | | `sentinal_retraining_dag.py` | 包含 3 个任务的 Airflow DAG：应用特征、物化到 Redis、训练模型 | | `definitions.py` | Feast 特征定义（TransactionAmt、card1、card2、addr1） | | `feature_store.yaml` | Feast 配置，指向 PostgreSQL 注册表和 Redis 在线存储 | | `serving_api/main.py` | 从 MinIO 加载模型并从 Redis 加载特征的 FastAPI 应用 | | `docker-compose.yaml` | 编排 9 个服务：Airflow、MLflow、Feast、Redis、PostgreSQL、MinIO、API、Dashboard | | `verify_setup.sh` | 用于验证所有服务和连接的健康检查脚本 | ## 功能 ### 自动化 ML Pipeline 该 pipeline 每日运行，包括： 1. **特征定义**：将 Feast 特征视图应用到 PostgreSQL 注册表 2. **智能化物化**： - 首次运行：全量物化（50K 笔交易 → Redis） - 后续运行：仅增量更新 3. **模型训练**： - 从 Feast 离线存储获取历史特征 - 训练 RandomForestClassifier (n_estimators=100, max_depth=10) - 将指标和模型记录到 MLflow - 将 artifact 保存到 MinIO S3 存储桶 **当前性能：** - 准确率：97.74% - 精确率：65.07% - 召回率：35.19%（需要改进） ### Feature Store **特征：** - `TransactionAmt`：交易金额 (Float32) - `card1`：主卡标识符 (Int64) - `card2`：辅助卡标识符 (Int64) - `addr1`：账单地址代码 (Float32) **架构：** - **离线存储**：用于批量训练的 Parquet 文件 - **在线存储**：用于实时服务的 Redis（<10ms 延迟） - **注册表**：用于元数据和特征定义的 PostgreSQL ### 实时服务 API **端点：** - `GET /`：服务信息 - `GET /health`：健康检查（返回模型和 Feature Store 状态） - `GET /predict/{transaction_id}`：获取交易的欺诈预测 - `GET /docs`：交互式 API 文档 (Swagger UI) **响应格式：** ``` { "transaction_id": 2987000, "is_fraud": true, "fraud_probability": 0.71, "status": "Success" } ``` ## 服务端点 | 服务 | URL | 凭据 | |---------|-----|-------------| | **Airflow UI** | http://localhost:8080 | admin / admin | | **MLflow UI** | http://localhost:5000 | - | | **API 文档** | http://localhost:8000/docs | - | | **API 预测** | http://localhost:8000/predict/{id} | - | | **仪表板** | http://localhost:8501 | - | | **MinIO 控制台** | http://localhost:9001 | minio_admin / minio_secure_pass | | **PostgreSQL** | localhost:5432 | sentinel_user / sentinel_secure_pass | | **Redis** | localhost:6379 | - | ## 开发 ### 本地设置（不使用 Docker）用于本地开发和测试： ``` # 创建虚拟环境 conda create -n fdp python=3.10 conda activate fdp # 安装 dependencies pip install -r serving_api/requirements.txt pip install feast apache-airflow mlflow scikit-learn # 设置环境变量 export MLFLOW_TRACKING_URI=http://localhost:5000 export FEAST_REPO_PATH=./feature_store # 本地运行 API cd serving_api uvicorn main:app --reload ``` ### 运行测试 ``` # 运行所有测试 pytest tests/ # 运行 coverage pytest --cov=serving_api tests/ # 运行特定测试 pytest tests/test_main.py::test_health_endpoint ``` ### 修改特征 1. 编辑 `feature_store/definitions.py` 以添加/修改特征 2. 应用更改： ``` docker exec -it sentinel_scheduler bash cd /app/feature_store feast apply ``` 3. 重新物化： ``` feast materialize 2026-01-08T00:00:00 2026-01-14T23:59:59 ``` ### 再训练模型 **自动**：DAG 每天 UTC 午夜运行 **手动**： 1. 转到 Airflow UI (http://localhost:8080) 2. 点击 `sentinel_mlops_pipeline` 3. 点击“Trigger DAG”（播放按钮） 4. 监控任务进度（应在约 5 分钟内完成） ### 查看日志 ``` # Airflow scheduler 日志 docker logs sentinel_scheduler -f # API 日志 docker logs sentinel_api -f # MLflow 日志 docker logs sentinel_mlflow -f # 所有服务 docker compose logs -f ``` ## 待办事项（_或许可以提交 PR_） - [ ] 模型评估指标 - [ ] 超参数调优 - [ ] 特征重要性跟踪 - [ ] 添加适当的单元测试 - [ ] Airflow 电子邮件警报 - [ ] 模型仓库集成 - [ ] API 中的模型版本控制 - [ ] 添加监控 - [ ] 数据漂移检测 - [ ] 使用 GitHub Actions 添加 CI/CD pipeline - [ ] 部署到云端 - [ ] A/B 测试框架 - [ ] 在线学习 - [ ] 可解释性 - [ ] 多模型服务 ## 故障排除 ### 验证脚本运行自动化健康检查： ``` ./verify_setup.sh ``` 此项检查： - 所有 Docker 容器正在运行 - 数据库连接 - MinIO 存储桶存在 - Feast 注册表可访问 - Redis 已填充特征 ## 贡献欢迎贡献！请： 1. Fork 该仓库 2. 创建功能分支 (`git checkout -b feature/AmazingFeature`) 3. 提交更改 (`git commit -m 'Add AmazingFeature'`) 4. 推送到分支 (`git push origin feature/AmazingFeature`) 5. 发起 Pull Request ### 开发指南 - 遵循 PEP 8 风格指南 - 为新功能添加测试 - 更新文档 - 在提交 PR 之前确保所有测试通过 ## 许可证该项目基于 MIT 许可证授权 - 有关详细信息，请参阅 [LICENSE](LICENSE) 文件。 ## 致谢 - **数据集**：[IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection) (Kaggle) - **灵感来源于**：Netflix、Uber 和 Airbnb 的生产级 MLOps 最佳实践 - **构建基于**：Feast、MLflow、Airflow、FastAPI 以及令人惊叹的开源机器学习社区 ## 联系方式 **Chidubem** - [@Duks31](https://github.com/Duks31) **项目链接**：[https://github.com/Duks31/fraud-detection-platform](https://github.com/Duks31/fraud-detection-platform)

标签：Apache Airflow, Apex, Kubernetes, MLOps, 搜索引擎查询, 机器学习, 模型推理服务, 欺诈检测, 测试用例, 版权保护, 特征工程, 逆向工具