sunilgentyala/model-provenance-guard

GitHub: sunilgentyala/model-provenance-guard

该项目为 CI/CD 流水线提供密码学来源验证和二进制安全检查，防范公共模型仓库引入的 ML 供应链安全风险。

Stars: 0 | Forks: 0

# model-provenance-guard **用于 CI/CD pipeline 中 ML 模型 artifact 的密码学来源验证与二进制检查。** 维护者：[Sunil Gentyala](https://linkedin.com/in/sunil-gentyala) | [github.com/sunilgentyala](https://github.com/sunilgentyala) [![许可证: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/) ## 为什么需要此项目从公共模型中心下载的 GGUF、Safetensors 和 PyTorch checkpoint 文件，在企业 MLOps 中构成了一个活跃且很大程度上未被监控的供应链风险。传统的 SAST、DAST 和防病毒工具无法检查二进制权重格式。本仓库提供了一个可用于生产环境的 GitHub Actions 工作流和一个 Python 检查脚本，用于在将任何模型 artifact 接入下游 pipeline 之前，强制执行密码学哈希验证、Sigstore/Cosign 签名检查、pickle opcode 扫描以及 Safetensors header 异常检测。此工具包是 Help Net Security 专栏文章“Weaponized Weights: The Impending Supply Chain Crisis in GGUF and Safetensors”（作者：Sunil Gentyala）的操作指南配套。 ## 仓库结构 ``` model-provenance-guard/ ├── .github/ │ └── workflows/ │ └── model_scan.yml # Full CI/CD provenance gate workflow ├── scripts/ │ └── verify_weights.py # Safetensors header inspector and anomaly detector ├── registry/ │ └── trusted_models.json # Internal trusted model hash registry (template) ├── docs/ │ └── threat-model.md # Format-level threat model for GGUF, ST, PT ├── requirements.txt # Python dependencies ├── .gitignore └── README.md ``` ## Pipeline 强制执行的检查 | 控制措施 | 格式 | 工具 | |---|---|---| | SHA-256 哈希验证 | 所有 | Python hashlib + trusted_models.json | | Cosign 签名验证 | 所有 | sigstore/cosign-installer | | Pickle opcode 扫描 | .pt / .pth | picklescan | | Safetensors header 检查 | .safetensors | verify_weights.py | | GGUF magic byte 验证 | .gguf | verify_weights.py | | 失败时的 Artifact 隔离 | 所有 | 工作流失败 + 带注释的摘要 | ## 快速入门：部署工作流 ### 前置条件 - 带有 Python 3.9 或更高版本的 GitHub Actions runner - runner 环境中可用的 `cosign`（通过 `sigstore/cosign-installer` 由工作流安装） - 填充了已批准模型哈希的 `trusted_models.json` 注册表（参见 `registry/` 中的模板） ### 步骤 1：填充您的受信任注册表编辑 `registry/trusted_models.json`，将您的组织已审查和批准的每个模型 artifact 的 SHA-256 哈希包含在内： ``` { "models": [ { "name": "mistral-7b-instruct-v0.3", "filename": "model.safetensors", "sha256": "a1b2c3d4e5f6...", "source_url": "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3", "reviewed_by": "engineer@yourorg.com", "reviewed_at": "2025-04-15T10:00:00Z", "cosign_bundle": "model.safetensors.bundle" } ] } ``` 哈希值必须由人工审核者在初次审查时从 artifact 中计算得出，而不是从 Hugging Face 上的模型卡片中获取。注册表本身应通过分支保护规则进行保护，要求对任何修改它的 pull request 至少有一名审核者批准。 ### 步骤 2：添加工作流将 `.github/workflows/model_scan.yml` 复制到您的仓库中。该工作流在修改包含模型 artifact 的任何路径的 pull request 上触发，并且还会针对您的 staging 模型缓存每晚定时运行一次。设置以下仓库 secret： | Secret | 用途 | |---|---| | `COSIGN_PUBLIC_KEY` | 用于验证模型签名的 PEM 编码的公钥 | | `TRUSTED_REGISTRY_HASH` | `trusted_models.json` 的 SHA-256，由注册表完整性作业用来检测篡改 | ### 步骤 3：在本地运行 Header 检查器 ``` pip install -r requirements.txt python scripts/verify_weights.py --file /path/to/model.safetensors --registry registry/trusted_models.json ``` 对于 GGUF 文件： ``` python scripts/verify_weights.py --file /path/to/model.gguf --registry registry/trusted_models.json ``` 对于 PyTorch checkpoint，工作流会自动运行 picklescan。您也可以直接调用它： ``` pip install picklescan picklescan -p /path/to/model.pt ``` ## 威胁模型有关涵盖 .pt 文件中的 pickle 反序列化、GGUF metadata 注入、Safetensors header 篡改和神经网络后门场景的全格式级别的攻击面分析，请参阅 [docs/threat-model.md](docs/threat-model.md)。 ## 与现有 MLOps 技术栈的集成该工作流旨在位于模型引入边界：处于 artifact 下载步骤与将模型注册到内部模型注册表（MLflow、Weights and Biases、SageMaker Model Registry 或等效产品）的步骤之间。来源验证环节的任何失败都应阻止提升。请配置您的编排层（Kubeflow Pipelines、Metaflow、Airflow），将此工作流的非零退出视为硬停止，而不是警告。 ## 许可证 MIT 许可证。请参阅 [LICENSE](LICENSE)。 ## 作者 **Sunil Gentyala** HCLTech 首席网络安全与 AI 安全顾问 IEEE 高级会员 | 云安全联盟代表 | ISACA 专业会员创建者：ContextGuard（零信任 MCP middleware）| GSH Framework（代理式 AI 威胁追踪） [linkedin.com/in/sunil-gentyala](https://linkedin.com/in/sunil-gentyala) | [github.com/sunilgentyala](https://github.com/sunilgentyala) 邮箱：sunil.gentyala@ieee.org

标签：Python, 密钥泄露防护, 无后门, 机器学习安全, 模型校验, 逆向工具