xiexie-qiuligao/cve-

GitHub: xiexie-qiuligao/cve-

面向自动渗透 Agent 和 RAG 的高信号 CVE 结构化知识库构建与检索工具，支持离线运行和多种检索模式。

Stars: 6 | Forks: 0

# CVE KB 面向自动渗透 Agent / RAG 的高信号 CVE 知识库构建与检索项目。这个项目不是把 PoC、博客、漏洞文章整段塞进向量库，而是把公开 PoC 信号、NVD 元数据、检测模板、攻击面线索整理成适合大模型检索和推理的结构化 CVE 卡片。当前默认已经切到完整大库： - 年份范围：`2018-2026` - 默认知识库目录：`out-prod-2018plus` - 默认向量库目录：`vectordb-2018plus` - 默认优先走离线模式 ## 适合的场景 - 自动渗透 Agent 的 CVE 检索底座 - 比赛 / CTF / 实战中的漏洞快速定位 - 面向云安全、Kubernetes、AI infra、身份系统、CI/CD 的漏洞知识召回 - 根据目标的端口、路径、header、版本、产品特征反查候选 CVE - 从 CVE 检索结果里继续拿 exploit / detection / chain 线索 ## 当前能力 - 从 `PoC-in-GitHub` 构建结构化 CVE 记录 - 生成 `overview / exploit / detection` 三类 chunk - 生成 BM25 检索索引 - 生成攻击链关系图 - 生成攻击面指纹库 - 支持 `bm25 / vector / hybrid` - 支持离线本地向量库 - 支持离线 `NVD feed` - 支持离线 `nuclei-templates` - 支持细粒度标签： - `cloud` - `k8s` - `ai_infra` - `identity` - `cicd` ## 为什么不是普通 CVE 向量库普通做法的问题通常是： - 原始文本噪声大，README 和文章内容会污染 embedding - 搜产品名时容易语义漂移到别的漏洞 - 很难保留版本、端口、路径、header、payload 这些结构化线索 - 不方便 Agent 后续判断“能不能打”“怎么探测”“怎么串链” 这个项目的目标是把每条 CVE 变成一张高信号检索卡片，核心字段包括： - `products / vendors` - `affected_versions` - `vuln_types` - `asset_tags / domain_tags` - `cloud_tags / k8s_tags / ai_infra_tags / identity_tags / cicd_tags` - `attack_surface` - `exploit_recipe` - `detection_query` - `search_terms` - `retrieval_text` ## 仓库结构 ### 统一入口 - `kb.py` - 最推荐的使用入口 - 提供 `build / query / ask / target / exploit / detect / chain / fingerprint / ingest` - 默认已经指向 `2018-2026` 大库 ### 构建入口 - `build_cve_kb.py` - 更底层的构建 CLI - 适合传更细的参数 ### 检索入口 - `retriever.py` - 本地检索器 - 支持： - `bm25` - `vector` - `hybrid` - 人话 `ask` - 结构化 `target` - 指纹匹配 - exploit / detection / chain ### 向量入库 - `ingest.py` - 把 `chunks.jsonl` 做 embedding 后写入 FAISS - 支持本地 embedding ### 桥接 - `bridge.py` - 把 `records.jsonl` 转成更适合 Agent / Playground 使用的条目格式 ### 人工精修模板 - `curated_overrides.example.json` - 比赛前对高价值 CVE 做人工覆盖 ### 核心代码目录 - `cve_kb_builder/builder.py` - 主流水线 - 负责把 PoC、NVD、README、reference、nuclei 模板等整合成最终知识库 - `cve_kb_builder/taxonomy.py` - 标签、规则、攻击面、exploit recipe 推断 - `cve_kb_builder/sources.py` - 数据源读取与 enrichment - 当前支持： - `PoC-in-GitHub` - 在线 NVD / EPSS / KEV - 离线 `NVD feed` - GitHub README - reference 深抓 - 离线 `nuclei-templates` - `cve_kb_builder/indexing.py` - BM25、chain graph、fingerprint 生成 - `cve_kb_builder/models.py` - 数据结构定义 ## 输出文件构建后会产出这些文件： - `records.jsonl` - 主记录 - `chunks.jsonl` - 给 embedding / RAG 用的 chunk - `bm25_docs.jsonl` - BM25 文档 - `bm25_index.json` - BM25 索引 - `chain_graph.json` - 攻击链关系 - `fingerprints.jsonl` - 攻击面 / 指纹库 - `summary.json` - 构建摘要 ## 默认库说明当前默认就是大库： - `out-prod-2018plus` - `vectordb-2018plus` 这份库覆盖： - `2018` - `2019` - `2020` - `2021` - `2022` - `2023` - `2024` - `2025` - `2026` 目前本地完整构建结果： - `7533` 条 CVE - `22599` 个 chunk ## 快速开始 ### 1. 准备数据源先把 `PoC-in-GitHub` 克隆到本地： git clone https://github.com/nomi-sec/PoC-in-GitHub.git source_repo 如果你要离线增强，建议再准备： git clone https://github.com/projectdiscovery/nuclei-templates offline_feeds/nuclei-templates 离线 NVD feed 放到： - `offline_feeds/nvd/` 文件名格式例如： - `nvdcve-2.0-2024.json.gz` - `nvdcve-2.0-2025.json.gz` ### 2. 构建完整本地大库 python kb.py build --nvd-feed-dir offline_feeds/nvd --nuclei-templates-dir offline_feeds/nuclei-templates 默认会构建： - `2018-2026` - 输出到 `out-prod-2018plus` ### 3. 构建本地向量库 python kb.py ingest --provider local 默认会把： - `out-prod-2018plus/chunks.jsonl` 写入： - `vectordb-2018plus` ## 最常用命令 ### 普通查询 python kb.py query "Kubernetes ingress admission controller rce" --mode hybrid --provider local ### 人话提问 python kb.py ask "目标是个 443 的管理入口，像防火墙或者网关，登录页面带远程接入入口，版本大概 10.2.4，想找未授权命令执行" ### 结构化目标检索 python kb.py target --product keycloak --version 24.0.3 --path /realms/ --text "oidc token authorization bypass" ### exploit chunk python kb.py exploit CVE-2024-3400 ### detection chunk python kb.py detect CVE-2024-3400 ### chain python kb.py chain CVE-2024-3400 ### fingerprint python kb.py fingerprint "port 443 open, remote access login portal, firewall gateway" ## 检索模式 - `bm25` - 最快 - 适合产品名、路径、端口、header 明确的查询 - `vector` - 适合更自然语言、语义化的问题 - `hybrid` - 最推荐 - BM25 负责硬信号，向量负责语义补充 ## 离线增强来源为了让前几年的 CVE 也尽量保留版本、路径、端口、检测线索，当前项目已经接入： - 离线 `NVD feed` - 补： - `products` - `vendors` - `affected_versions` - `CPE` - 离线 `nuclei-templates` - 补： - `paths` - `headers` - `request_examples` - 部分端口 - `detection_query` ## 不建议提交到 GitHub 的内容这些内容一般不要直接推进源码仓库： - `source_repo/` - `offline_feeds/` - `out-*` - `vectordb*` - `*.log` - `__pycache__/` - `.pytest_cache/` - `.claude/` ## 安装向量依赖如果要跑本地向量库，建议安装： pip install langchain-core langchain-community faiss-cpu fastembed pyyaml 如果要用在线 embedding，再额外装： pip install langchain-openai ## 一句话总结这是一个给自动渗透 Agent / RAG 用的 CVE 检索底座，不是普通资料归档仓库。它强调的是： - 高信号结构化记录 - 离线可用 - 版本 / 端口 / 路径 / header / exploit 线索 - 适合大模型后续继续推理和行动

标签：AI基础设施安全, BM25, Chrome Headless, CI/CD安全, CISA项目, CTF安全, CVE-2024, CVE-2025, CVE知识库, DevSecOps, Exploit-DB, Google, Kubernetes安全, Llama, Nuclei, NVD数据, PoC管理, Web报告查看器, XSS, 上游代理, 向量检索, 域名收集, 大模型RAG, 威胁情报, 密码管理, 开发者工具, 攻击链分析, 攻击面指纹, 混合检索, 漏洞情报, 离线知识库, 结构化数据, 网络安全, 自动渗透Agent, 语义搜索, 身份安全, 逆向工具, 隐私保护