S0UGATA/security-kg

GitHub: S0UGATA/security-kg

将15个权威安全数据源统一转换为SPO知识图谱三元组，解决安全数据碎片化问题

Stars: 1 | Forks: 0

# security-kg [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/d35c0c23d2031625.svg)](https://github.com/S0UGATA/security-kg/actions/workflows/ci.yml) [![Dataset Update](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/1705e82d45031626.svg)](https://github.com/S0UGATA/security-kg/actions/workflows/update-dataset.yml) [![HuggingFace](https://img.shields.io/badge/dataset-HuggingFace-yellow)](https://huggingface.co/datasets/s0u9ata/security-kg) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE) 将来自 15 个来源的安全数据转换为 Parquet 格式的 **主语-谓语-宾语 (SPO) 知识图谱三元组**。来源：[ATT&CK](https://attack.mitre.org/) · [CAPEC](https://capec.mitre.org/) · [CWE](https://cwe.mitre.org/) · [CVE](https://www.cve.org/) · [CPE](https://nvd.nist.gov/products/cpe) · [D3FEND](https://d3fend.mitre.org/) · [ATLAS](https://atlas.mitre.org/) · [CAR](https://car.mitre.org/) · [ENGAGE](https://engage.mitre.org/) · [EPSS](https://www.first.org/epss/) · [KEV](https://www.cisa.gov/known-exploited-vulnerabilities-catalog) · [Vulnrichment](https://github.com/cisagov/vulnrichment) · [GHSA](https://github.com/github/advisory-database) · [Sigma](https://github.com/SigmaHQ/sigma) · [ExploitDB](https://gitlab.com/exploit-database/exploitdb) ## 数据流 ``` --- config: layout: dagre theme: neo --- flowchart LR STIX["ATT&CK STIX JSON"]:::src --> CONV["convert.py"]:::conv CXML["CAPEC XML"]:::src --> CONV WXML["CWE XML"]:::src --> CONV CVEJ["CVE JSON 5.x"]:::src --> CONV CPEJ["CPE JSON"]:::src --> CONV D3FJ["D3FEND JSON-LD"]:::src --> CONV ATLY["ATLAS YAML"]:::src --> CONV CARY["CAR YAML"]:::src --> CONV ENGJ["ENGAGE JSON"]:::src --> CONV EPSC["EPSS CSV"]:::src --> CONV KEVJ["KEV JSON"]:::src --> CONV VULJ["Vulnrichment JSON"]:::src --> CONV GHSJ["GHSA JSON"]:::src --> CONV SIGY["Sigma YAML"]:::src --> CONV EDBC["ExploitDB CSV"]:::src --> CONV CONV --> ATK["enterprise / mobile / ics / attack-all"]:::out --> CMB["combined.parquet"]:::conv CONV --> CAP["capec"]:::out --> CMB CONV --> CW["cwe"]:::out --> CMB CONV --> CVE["cve"]:::out --> CMB CONV --> CPE["cpe"]:::out --> CMB CONV --> D3F["d3fend"]:::out --> CMB CONV --> ATL["atlas"]:::out --> CMB CONV --> CAR["car"]:::out --> CMB CONV --> ENG["engage"]:::out --> CMB CONV --> EPS["epss"]:::out --> CMB CONV --> KEV["kev"]:::out --> CMB CONV --> VUL["vulnrichment"]:::out --> CMB CONV --> GHS["ghsa"]:::out --> CMB CONV --> SIG["sigma"]:::out --> CMB CONV --> EDB["exploitdb"]:::out --> CMB CMB --> HF["HuggingFace Hub"]:::hf classDef src fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f classDef conv fill:#f3f4f6,stroke:#6b7280,color:#374151 classDef out fill:#fef3c7,stroke:#f59e0b,color:#78350f classDef hf fill:#d1fae5,stroke:#10b981,color:#064e3b ``` ## 知识图谱结构 ``` --- config: layout: dagre theme: neo --- graph LR %% ATT&CK core C[Campaign]:::attack -->|attributed-to| G[Group]:::attack C -->|uses| T[Technique]:::attack G -->|uses| T G -->|uses| SW[Malware / Tool]:::attack SW -->|uses| T ST[Sub-technique]:::attack -->|subtechnique-of| T T -->|belongs-to-tactic| TAC[Tactic]:::attack MIT[Mitigation]:::attack -->|mitigates| T DC[DataComponent]:::attack -->|detects| T %% Defense & detection → Technique DT[DefensiveTechnique]:::d3fend -->|counters| T AN[Analytic]:::car -->|detects-technique| T AN -->|maps-to-d3fend| DT EA[EngagementActivity]:::engage -->|engages-technique| T AT[ATLAS Technique]:::atlas -->|related-attack-technique| T %% CAPEC ↔ CWE bridge AP[Attack Pattern]:::capec -->|maps-to-technique| T AP -->|related-weakness| W[Weakness]:::cwe W -->|related-attack-pattern| AP %% Vulnerability chain V[Vulnerability]:::cve -->|related-weakness| W V -->|affects-cpe| P[Platform]:::cpe V -.->|epss-score| ES((EPSS)):::epss V -.->|kev| KE((KEV)):::kev classDef attack fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f classDef capec fill:#fef3c7,stroke:#f59e0b,color:#78350f classDef cwe fill:#fce7f3,stroke:#ec4899,color:#831843 classDef cve fill:#fee2e2,stroke:#ef4444,color:#7f1d1d classDef cpe fill:#e0e7ff,stroke:#6366f1,color:#312e81 classDef d3fend fill:#d1fae5,stroke:#10b981,color:#064e3b classDef car fill:#fef9c3,stroke:#eab308,color:#713f12 classDef engage fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95 classDef atlas fill:#cffafe,stroke:#06b6d4,color:#164e63 classDef epss fill:#f3f4f6,stroke:#6b7280,color:#374151 classDef kev fill:#f3f4f6,stroke:#6b7280,color:#374151 ``` ## 用法 ``` # Install dependencies pip install -r requirements.txt # Convert everything (all 15 sources) and produce combined.parquet python src/convert.py # Convert only ATT&CK python src/convert.py --sources attack # Convert a single ATT&CK domain python src/convert.py --sources attack --domains enterprise # Convert only CAPEC and CWE (skip others) python src/convert.py --sources capec cwe # Convert CVE, EPSS, and KEV together python src/convert.py --sources cve epss kev # Skip combined.parquet generation python src/convert.py --no-combined # Run individual converters standalone python src/convert_attack.py python src/convert_capec.py python src/convert_cve.py python src/convert_kev.py # Use Parquet v1 format for backward compatibility (default is v2) python src/convert.py --parquet-format v1 ``` 源文件默认缓存在 `source/` 中。文件使用 `Last-Modified` 或 `ETag` 头进行版本控制，仅在源更新时重新下载。不提供版本头的来源将始终重新下载。输出保存至 `output/`： | 文件 | 来源 | 预估三元组数量 | |------|--------|-------------| | `enterprise.parquet` | ATT&CK Enterprise | ~42K | | `mobile.parquet` | ATT&CK Mobile | ~5K | | `ics.parquet` | ATT&CK ICS | ~4K | | `attack-all.parquet` | ATT&CK 合并（已去重） | ~50K | | `capec.parquet` | CAPEC 攻击模式 | ~8K | | `cwe.parquet` | CWE 弱点 | ~15K | | `cve.parquet` | CVE 漏洞 | ~1.5-3M | | `cpe.parquet` | CPE 平台枚举 | ~2-4M | | `d3fend.parquet` | D3FEND 防御技术 | ~3K | | `atlas.parquet` | ATLAS AI/ML 技术 | ~3K | | `car.parquet` | CAR 分析 | ~2K | | `engage.parquet` | ENGAGE 对手交战 | ~2K | | `epss.parquet` | EPSS 漏洞利用预测评分 | ~650K | | `kev.parquet` | KEV 已知被利用漏洞 | ~9K | | `vulnrichment.parquet` | CISA Vulnrichment (SSVC, CVSS, CWE) | ~200-400K | | `ghsa.parquet` | GitHub Security Advisories | ~20-40K | | `sigma.parquet` | Sigma 检测规则 | ~20-40K | | `exploitdb.parquet` | ExploitDB 公开漏洞利用 | ~300-500K | | `combined.parquet` | 所有来源合并（已去重） | ~5-10M | ## 跨源链接 ``` ATT&CK <──> CAPEC <──> CWE <──> CVE <──> CPE ^ ^ ├── D3FEND (counters) ├── EPSS (scores) ├── ATLAS (AI parallel) ├── KEV (exploited) ├── CAR (detects) ├── Vulnrichment (SSVC/CVSS) ├── ENGAGE (engages) ├── GHSA (advisories) └── Sigma (detects) ├── Sigma (related CVE) └── ExploitDB (exploits) ``` ## 测试 ``` # Unit tests (no network access required) python -m pytest tests/ -v --ignore=tests/test_integration.py # Integration tests (downloads real ATT&CK data) python -m pytest tests/test_integration.py -v # All tests python -m pytest tests/ -v ``` ## HuggingFace 数据集该数据集发布在 HuggingFace Hub 的 [s0u9ata/security-kg](https://huggingface.co/datasets/s0u9ata/security-kg) 上，并通过 GitHub Actions 每周自动更新。有关 schema 详情、示例查询以及如何使用 `datasets` 库，请参阅 [数据集卡片](hf_dataset/README.md)。 ## 未来数据来源以下来源已经过研究和评估，暂缓收录，但可能会在未来的版本中添加。 ### 高价值暂缓来源 | 来源 | 格式 | 暂缓原因 | |--------|--------|-------------| | [MISP Galaxies](https://github.com/MISP/misp-galaxy) | JSON | 结构优秀且包含 ATT&CK 映射；涵盖威胁行为体、工具、行业的 100 多个星系集群。为保持初始范围可控，暂时搁置。 | | [EUVD](https://euvd.enisa.europa.eu/) | JSON | EU 漏洞数据库，结构化，关联 CVE。较新（2025 年推出），API 尚不成熟。 | | [OSV](https://osv.dev/) | JSON | Google 的开源漏洞数据库，支持批量下载。侧重于软件包而非 CVE 级别的漏洞。 | ### 已调研的国际来源 | 来源 | 国家/地区 | 状态 | |--------|---------|--------| | [JVN iPedia](https://jvndb.jvn.jp/) | 日本 | 提供 RSS 订阅，关联 CVE，双语（日/英）。批量结构化数据访问受限。 | | [ThaiCERT](https://apt.thaicert.or.th/) | 泰国 | 504 个 APT 组织威胁卡片，结构化。覆盖范围小众，API 受限。 | | [CNNVD](http://www.cnnvd.org.cn/) / [CNVD](https://www.cnvd.org.cn/) | 中国 | 非中国大陆 IP 存在访问限制，数据质量存疑，相比 NVD 延迟严重。 | | [KrCERT](https://www.krcert.or.kr/) / KNVD | 韩国 | 公开 API 受限，仅支持韩语。 | | [BSI](https://www.bsi.bund.de/) | 德国 | 提供建议，德语，无批量结构化订阅。 | | [ANSSI](https://www.cert.ssi.gouv.fr/) | 法国 | 建议和 IOC 报告，法语，机器可读数据有限。 | | [CERT-In](https://www.cert-in.org.in/) | 印度 | CVE CNA，发布建议但无批量结构化数据下载。 | | [AusCERT](https://auscert.org.au/) | 澳大利亚 | 提供 RSS 订阅，英语。除建议外结构化数据有限。 | | [CERT-EU](https://cert.europa.eu/) | 欧盟 | 威胁态势报告，机器可读数据有限。 | | [BDU (FSTEC)](https://bdu.fstec.ru/) | 俄罗斯 | 数据质量差，更新缓慢，存在访问限制。 | ### 专门 / 小众来源 | 来源 | 未收录原因 | |--------|-----------------| | [MAEC](https://maecproject.github.io/) | 恶意软件属性枚举。社区采用率低，可用结构化数据有限。 | | [OVAL](https://oval.mitre.org/) | 侧重合规的 XML 定义。体积非常大，侧重于系统配置而非威胁上下文。 | | [CCE](https://ncp.nist.gov/cce) | 配置枚举（Excel 格式）。范围狭窄，跨链接潜力有限。 | ## 许可证 Apache 2.0

标签：ATLAS, CAPEC, CVE, D3FEND, EPSS, ETL, ExploitDB, GPT, Hugging Face, JavaCC, KEV, MITRE, Parquet, Python, RAG知识库, SPO, STIX, 三元组, 大语言模型微调数据, 威胁情报, 安全数据集, 开发者工具, 数字签名, 数据清洗, 无后门, 漏洞管理, 知识抽取, 网络安全, 逆向工具, 隐私保护