vulnerability-lookup/VulnTrain

GitHub: vulnerability-lookup/VulnTrain

基于Vulnerability-Lookup多源漏洞数据的AI数据集生成与领域专用模型训练工具链

Stars: 29 | Forks: 4

# VulnTrain [![最新版本](https://img.shields.io/github/release/vulnerability-lookup/VulnTrain.svg?style=flat-square)](https://github.com/vulnerability-lookup/VulnTrain/releases/latest) [![许可证](https://img.shields.io/github/license/vulnerability-lookup/VulnTrain.svg?style=flat-square)](https://www.gnu.org/licenses/gpl-3.0.html) [![PyPi 版本](https://img.shields.io/pypi/v/VulnTrain.svg?style=flat-square)](https://pypi.org/project/VulnTrain) VulnTrain 提供了一套命令，利用来自 [Vulnerability-Lookup](https://github.com/vulnerability-lookup/vulnerability-lookup) 的全面漏洞数据，生成多样化的 AI 数据集并训练模型。它利用了来自所有支持的公告源（CVE、GitHub advisories、CSAF、PySecDB、CNVD）的超过一百万条 JSON 记录，以构建高质量的特定领域模型。此外，还整合了来自 ``vulnerability-lookup:meta`` 容器的数据（包括 vulnrichment 和 Fraunhofer FKIE 等丰富化来源），以提升模型质量。在 Hugging Face 上查看数据集和模型： [![HF 上的模型](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-xl-dark.svg)](https://huggingface.co/CIRCL) 有关在 Vulnerability-Lookup 中使用 AI 的更多信息，请参阅 [用户手册](https://www.vulnerability-lookup.org/user-manual/ai/)。 ## 安装 ``` pipx install VulnTrain ``` 用于开发： ``` git clone https://github.com/vulnerability-lookup/VulnTrain.git cd VulnTrain/ poetry install ``` ## 用法提供三种类型的命令： - **数据集生成**：从漏洞源创建和准备数据集。 - **模型训练**：使用准备好的数据集训练模型。 - **模型验证**：评估训练模型的性能（验证、基准测试等）。 ### CLI 命令 | 命令 | 用途 | |---------|---------| | `vulntrain-dataset-generation` | 从漏洞源生成数据集 | | `vulntrain-train-severity-classification` | 训练严重性分类器 (RoBERTa/DistilBERT) | | `vulntrain-train-severity-cnvd-classification` | 训练针对 CNVD 数据的严重性分类器 | | `vulntrain-train-description-generation` | 训练 GPT-2 漏洞描述生成器 | | `vulntrain-train-cwe-classification` | 根据 patch 训练 CWE 分类器 | | `vulntrain-validate-severity-classification` | 验证严重性模型 | | `vulntrain-validate-text-generation` | 验证文本生成模型 | ### 模型 - 严重性分类：[![HF 上的模型](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/CIRCL/vulnerability-severity-classification-roberta-base) - 描述生成：[![HF 上的模型](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/CIRCL/vulnerability-description-generation-gpt2#how-to-get-started-with-the-model) ## 在 HPC 集群上进行分布式训练 VulnTrain 通过 SLURM 支持分布式多 GPU 训练，适用于 EuroHPC 风格的 GPU 集群。有关 Conda 环境设置、单节点和多节点 SLURM 作业脚本以及 NCCL 配置，请参阅 [HPC 文档](docs/hpc.md)。 ## 文档查看完整的[文档](docs/)以获取详细的使用说明、数据集生成示例和训练配方。 ## 如何引用 Bonhomme, C., & Dulaunoy, A. (2025). VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification (Version 1.4.0) [Computer software]. https://doi.org/10.48550/arXiv.2507.03607 ``` @misc{bonhomme2025vlai, title={VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification}, author={Cédric Bonhomme and Alexandre Dulaunoy}, year={2025}, eprint={2507.03607}, archivePrefix={arXiv}, primaryClass={cs.CR} } ``` ## 许可证 [VulnTrain](https://github.com/vulnerability-lookup/VulnTrain) 根据 [GNU General Public License version 3](https://www.gnu.org/licenses/gpl-3.0.html) 授权 ``` Copyright (c) 2025-2026 Computer Incident Response Center Luxembourg (CIRCL) Copyright (C) 2025-2026 Cédric Bonhomme - https://github.com/cedricbonhomme Copyright (C) 2025 Léa Ulusan - https://github.com/3LS3-1F ```

标签：Apex, CSAF, CVE, GitHub Advisory, GPT, Hugging Face, IaC 扫描, LLM训练, NLP, Poetry, Python, Trivy, 人工智能, 元数据丰富, 凭据扫描, 威胁情报, 安全智能, 安全运营, 开发者工具, 开源搜索引擎, 扫描框架, 数字签名, 数据增强, 数据清洗, 数据集生成, 文本生成, 无后门, 机器学习, 模型微调, 漏洞管理, 用户模式Hook绕过, 系统调用监控, 网络安全, 逆向工具, 隐私保护