ibikigosu/DataLens

GitHub: ibikigosu/DataLens

一款结合确定性规则与统计异常检测的联邦采购数据质量分析助手，帮助分析师发现、解释并排序供应商和交易记录中的可疑问题。

Stars: 0 | Forks: 0

# DataLens *用于采购记录的 Human-in-the-loop 数据质量分析* [![Python](https://img.shields.io/badge/Python->=3.11-3776ab?style=flat-square&logo=python&logoColor=white)](https://www.python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![Streamlit](https://img.shields.io/badge/Streamlit-ff4b4b?style=flat-square&logo=streamlit&logoColor=white)](https://streamlit.io) [![scikit-learn](https://img.shields.io/badge/scikit--learn-f7931e?style=flat-square&logo=scikit-learn&logoColor=white)](https://scikit-learn.org) [![Ruff](https://img.shields.io/badge/Ruff-261230?style=flat-square&logo=ruff&logoColor=white)](https://docs.astral.sh/ruff/) [概述](#overview) • [功能](#features) • [快速开始](#getting-started) • [API](#api) • [项目结构](#project-structure)

## 概述 DataLens 是一款用于结构化采购数据的 human-in-the-loop 数据质量助手。它将确定性检查与统计异常检测相结合，帮助分析师发现、解释并确定供应商和交易记录中问题的优先级。 ## 功能 - **确定性质量规则** - 使用带有类型和严重性分级的规则来检测已知的数据质量问题 - **统计异常检测** - 比较 Isolation Forest 和 Local Outlier Factor 模型以提供补充证据 - **受保护的审查队列** - 在对待审查记录进行排序时保护关键的确定性发现 - **有限的解释** - 每个发现都包含可追溯的证据，且最多带有三个特征偏差 - **版本控制的配置** - Schema 契约、评分权重和模型参数均处于版本控制之下 - **实验跟踪** - MLflow 为每次模型对比记录参数、指标和 artifact - **FastAPI 评分服务** - 提供用于单条记录验证、批量评分和反馈的版本化 REST API - **可复现的 notebook** - Jupyter notebook 作为经过测试的 Python 模块的轻量级分析驱动程序 ## 工作原理 ``` USAspending data | v Acquisition and preparation | +--> Deterministic quality rules | +--> Vendor and transaction feature pipelines | v Anomaly models | v Guarded review queue ``` FY2024 用于开发和模型选择。 FY2025 保持封存状态直至进行时间维度评估，因此其分布不会影响训练或预处理。确定性基准目前是主要的问题检测器。统计异常是补充证据，因为经过评估的模型尚未达到推广标准。 ## 快速开始 ### 前置条件 - [Python](https://www.python.org) 3.11 或更高版本 - [uv](https://docs.astral.sh/uv/) 0.11 或更高版本 - [Git](https://git-scm.com) 2.50 或更高版本 ### 安装 ``` git clone https://github.com/ibikigosu/DataLens.git Set-Location DataLens uv sync --all-groups ``` ### 获取并准备数据 ``` uv run python -m datalens.data.usaspending acquire uv run python -m datalens.data.prepare ``` 下载的数据保留在本地，并被排除在 Git 之外。版本化的清单保留了来源和完整性元数据，以确保可复现性。 ### 运行分析运行确定性基准： ``` uv run python -m datalens.baseline.run ``` 训练、比较并评估异常模型： ``` uv run python -m datalens.modeling.run ``` ### 运行 API 在本地启动 FastAPI： ``` uv run uvicorn datalens.api.app:app --host 0.0.0.0 --port 8000 ``` OpenAPI 文档可在 `http://localhost:8000/docs` 获取。所有公开的应用程序路由均通过 `/api/v1` 进行版本控制。 ## 配置版本控制的默认设置位于 `config/` 目录下。采购 schema 定义了必需的列、关系和质量评分权重。模型配置包含了特征和模型版本、模型参数、审查比例以及推广门槛。应用程序路径和服务端点由 `config/application/default.json` 提供。当本地环境需要不同的服务值时，请将 `.env.example` 复制为 `.env`。每个应用程序设置也可以使用 `DATALENS_` 环境变量进行覆盖，而无需修改源代码。例如： ``` $env:DATALENS_ARTIFACT_DIR = "C:\datalens-artifacts" $env:DATALENS_DATABASE_URL = "sqlite:///C:/datalens-artifacts/datalens.db" uv run python -m datalens.modeling.run ``` 无效的 JSON、未知的字段、缺失的 schema 列、无效的类型以及未声明的评分权重都会在训练或评分之前失败。模型比较的 artifact 会记录本次运行所使用的数据集、schema、特征和模型版本。 ## API ### 健康检查检查就绪状态： ``` curl.exe http://localhost:8000/api/v1/health/ready ``` ### 批量评分对成对的 CSV 文件进行评分： ``` curl.exe -X POST http://localhost:8000/api/v1/score/batch ` -F "fiscal_year=2024" ` -F "vendors=@vendors.csv;type=text/csv" ` -F "transactions=@transactions.csv;type=text/csv" ``` 以 JSON 或 CSV 格式检索发现的结果： ``` curl.exe http://localhost:8000/api/v1/runs/RUN_ID/findings curl.exe -OJ http://localhost:8000/api/v1/runs/RUN_ID/findings.csv ``` ### 单条记录验证以下请求演示了在评分之前验证失败的情况： ``` curl.exe -X POST http://localhost:8000/api/v1/score/vendor ` -H "Content-Type: application/json" ` -d '{"fiscal_year":2024,"record":{"vendor_id":"V1"}}' ``` 响应状态为 HTTP 422，因为该记录不满足已批准的供应商 schema。评分运行、结果发现、反馈和重新训练的决策均通过 SQLAlchemy 进行持久化。本地默认使用 SQLite，而容器栈使用 PostgreSQL。 ### MLflow UI 打开本地 MLflow 实验查看器： ``` uv run mlflow ui --backend-store-uri sqlite:///artifacts/mlflow.db ``` ## Notebook 这些 notebook 是可复现的分析驱动程序。可重用的行为存在于 `datalens` 包中，而不是 notebook 的状态中。运行 FY2024 的探索性分析： ``` uv run jupyter nbconvert --to notebook --execute notebooks/01_usaspending_pbs_eda.ipynb --inplace ``` 运行模型对比分析： ``` uv run jupyter nbconvert --to notebook --execute notebooks/02_model_comparison.ipynb --inplace ``` ## 评估方法 DataLens 根据注入到真实形态采购数据中的受控缺陷来评估确定性规则和统计模型。这提供了可复现的标签，而无需声称自然产生的异常公开记录是不正确的。评估包括： - 精确率、召回率和宏平均 F1（macro F1） - 前 50 名的精确率（Top-50 precision） - 每 1,000 条记录的误报率 - 高严重性和关键问题的召回率 - 在 FY2025 上的时间维度性能供应商和交易记录分别训练独立的模型，因为这些表具有不同的 schema 和质量信号。模型阈值和预处理参数仅从 FY2024 中学习。 ## 项目结构 ``` config/ Data acquisition and evaluation settings data/ Local datasets and versioned provenance manifests docs/ Architecture decisions, analysis results, and planning notebooks/ Reproducible analysis drivers scripts/ Repository verification tools src/datalens/ Acquisition, features, rules, models, and evaluation tests/ Automated unit and workflow tests artifacts/ Generated reports, models, and MLflow state ``` ## 验证同时运行格式化、lint、测试和覆盖率检查： ``` uv run python scripts/verify.py ``` 或者分别运行各项检查： ``` uv run ruff format --check . uv run ruff check . uv run pytest --cov=datalens --cov-report=term-missing ``` ## 当前状态数据获取、探索性分析、确定性基准、特征 pipeline、异常模型比较、MLflow 跟踪、从 notebook 到模块的清理、配置系统以及 FastAPI 评分服务均已完成。接下来的工作包括容器化、模型注册表和模型卡片，以及调用该 API 的 Streamlit 界面。

标签：AV绕过, FastAPI, Kubernetes, Python, scikit-learn, Streamlit, 人机协同, 异常检测, 数据质量, 无后门, 访问控制, 逆向工具