enma-69/vxug-scraper

GitHub: enma-69/vxug-scraper

面向安全研究者的 VX-Underground 恶意软件档案批量下载工具，通过真实浏览器绕过 Cloudflare 并自动完成样本分类与报告生成。

Stars: 0 | Forks: 0

# vxug-scraper [![CI](https://github.com/YOUR_USER/vxug-scraper/actions/workflows/ci.yml/badge.svg)](https://github.com/YOUR_USER/vxug-scraper/actions/workflows/ci.yml) [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey)](#requirements) ## 关于 [VX-Underground](https://vx-underground.org) 托管着世界上最大的公开可用恶意软件研究档案：构建工具、研究论文以及超过一百万个按家族、平台和年份分类的恶意软件样本。该网站运行在 Cloudflare 背后，并使用 **Phoenix LiveView** (Elixir) 作为其文件浏览器，这意味着整个文件夹树是通过单个 WebSocket 进行导航的 —— 没有可供抓取的普通 `` 链接。 `vxug-scraper` 解决了这两个问题： | 问题 | 解决方案 | |---|---| | **Cloudflare Bot Fight** | 通过 [pydoll-python](https://github.com/thalissonvs/pydoll) 启动真实的 `msedge.exe` 二进制文件。真实的 TLS/JA3 指纹、真实的 V8、真实的 `navigator.*` API —— CF 会将其视为普通的人类浏览器。 | | **Phoenix LiveView 导航** | 页面加载一次后会话保持打开状态。通过向 `[phx-click]` 元素注入点击操作来完成文件夹遍历 —— 没有页面重载，也没有新的 CF 验证。 | | **预签名 S3 URL 过期（约 1 小时）** | 爬取和下载作为流水线式的生产者/消费者运行 —— 文件在被发现的那一刻即被获取，绝不会在 URL 过期后获取。 | | **中断的运行** | 每个发现的 URL 都会在下载开始前写入 SQLite。重新运行相同的命令会立即跳过已完成的文件。 | | **速率限制** | 指数退避 + `Retry-After` 标头、可配置的工作线程并发数，以及一个可选的 watchdog，可在崩溃或挂起时自动重启。 | ### 收集内容 | 分区 | 内容 | 典型大小 | |---|---|---| | `Builders` | RAT 构建器、加密器、捆绑器、窃密器、僵尸网络 | ~5 GB | | `Papers` | 恶意软件分析报告、研究论文、POC | ~2 GB | | `Samples/Argus Collection` | 标记为 Argus 的恶意软件样本 | ~10 GB | | `Samples/Virusshare Collection` | 标记为 VirusShare 的样本 | ~50 GB | | `Samples/Bazaar Collection` | 标记为 MalwareBazaar 的样本 | ~50 GB+ | ### 分类流水线每个下载的文件都会自动在三个维度上进行标记： - **平台** — `Windows` `Linux` `Android` `macOS` `Script` `Java` `Document` `Archive` - **恶意软件类别** — `Ransomware` `RAT` `Stealer` `Backdoor` `Botnet` `Loader` `Banker` `Worm` `Rootkit` `Exploit` `Cryptominer` `Spyware` `Trojan` - **影响** — `Critical` → `High` → `Medium` → `Low` → `Info` 结果会存入 SQLite 和生成的 Markdown 报告中 —— 无需重新下载任何内容即可查询。 ## Cloudflare 绕过的工作原理 ``` ┌─────────────────────────────────────────────────────────────────┐ │ vxdl.py │ │ │ │ 1. Launch real Edge via pydoll-python (genuine TLS/JA3, │ │ real V8, real navigator.* APIs — CF sees a real browser) │ │ │ │ 2. Navigate ONCE to the target section │ │ └─ wait for CF challenge to clear (~5-90 s) │ │ │ │ 3. Crawl Phoenix LiveView tree by injecting clicks │ │ into [phx-click] elements — stays in the same │ │ WebSocket session (no page reloads → no new CF checks) │ │ │ │ 4. For each discovered file: push to asyncio download queue │ │ └─ aiohttp workers download immediately (presigned S3 │ │ URLs expire in ~1 h, so crawl + download are pipelined) │ │ │ │ 5. After all files done: sanitation → classification → report │ └─────────────────────────────────────────────────────────────────┘ ``` | CF 层 | 绕过机制 | |---|---| | TLS 指纹 (JA3) | 真实的 Edge 二进制文件 —— 真正的 JA3，而非 Python requests | | 浏览器 JS 检查 | 真实的 V8 引擎 —— `navigator.webdriver`、`chrome.*`、`permissions.*` 全都是真实的 | | Bot 验证页面 | 等待循环会轮询 `document.title`，直到 "checking"/"moment" 字样消失 | | 速率限制 | 指数退避 + `Retry-After` 标头，可配置的并发数 | | 会话连续性 | 每个分区仅进行一次导航；随后通过 LiveView 点击注入访问文件夹（无页面重载） | ## 环境要求 | 要求 | 说明 | |---|---| | Python 3.10+ | 推荐 3.11+ | | Microsoft Edge (稳定版) | [在此下载](https://www.microsoft.com/edge) — 必须安装，不能是便携版 | | 磁盘空间 | Builders 约需 5 GB，Papers 约需 10 GB，完整 Samples 需 500 GB+ | | 网络 | 稳定的连接 —— 中断后会自动恢复下载 | ## 安装说明 ``` git clone https://github.com/YOUR_USER/vxug-scraper.git cd vxug-scraper pip install -r requirements.txt ``` ## 使用方法 ### 基本用法 ``` # 仅 Builders section（默认，约 5 GB） python vxdl.py # 多个 sections python vxdl.py --sections Papers Builders # Sub-collection（名称中有空格 — 用引号括起来） python vxdl.py --sections "Samples/Argus Collection" --max-depth 6 # 测试运行 — 仅 3 个顶层文件夹，约 60 秒内即可看到运行效果 python vxdl.py --limit 3 ``` ### 自定义输出目录 ``` # 通过 flag python vxdl.py --out C:\Research\vxug # 通过环境变量（在多次运行间持久化） set VXUG_OUT=C:\Research\vxug # Windows export VXUG_OUT=/data/vxug # Linux / macOS python vxdl.py ``` ### 长时间运行 / 脱离终端运行（在终端关闭后依然存活） ``` # 启动 watchdog → watchdog 在崩溃/挂起时保持 vxdl.py 存活 python launch.py --sections Builders Papers --hours 72 # 完整的 Samples collection（非常大 — 需要数天时间） python launch.py --sections "Samples/Argus Collection" "Samples/Virusshare Collection" --hours 168 --concurrency 6 ``` ### 恢复中断的运行 ``` # 只需重新运行 — 已下载的 URL 将被跳过（SQLite 去重） python vxdl.py --sections Builders ``` ### 仅生成报告（只读，在下载器运行时使用是安全的） ``` python report.py # auto-finds output/vxdl.db python report.py --db output/vxdl.db --out output/ ``` ## 所有 CLI 标志 ### `vxdl.py` — 主流水线 | 标志 | 默认值 | 描述 | |---|---|---| | `--sections` | `Builders` | 一个或多个分区：`Builders` `Papers` `Samples` `"Samples/Argus Collection"` `"Samples/Virusshare Collection"` `"Samples/Bazaar Collection"` | | `--out` | `./output` | 下载根目录（`VXUG_OUT` 环境变量可覆盖） | | `--concurrency` | `4` | 并行下载工作线程数 | | `--cf-timeout` | `90` | 每次导航等待 Cloudflare 放行的秒数 | | `--max-depth` | `5` | 最大文件夹递归深度（Builders 需要设置为 5，Samples 需要设置为 6） | | `--limit` | `0` (全部) | 仅限前 N 个顶级文件夹 —— 适用于测试 | | `--stage` | 完整流水线 | 仅运行一个阶段：`download` `sanitize` `classify` `report` | | `--force` | 关闭 | 跳过环境可行性检查 | ### `watchdog.py` — 自动重启守护进程 | 标志 | 默认值 | 描述 | |---|---|---| | `--sections` | `Builders` | 直接传递给 `vxdl.py` | | `--hours` | `48` | 总运行时间预算 | | `--hang` | `360` | 磁盘/日志无增长的秒数，超过即判定为挂起 | | `--concurrency` | `4` | 直接传递给 `vxdl.py` | ### `launch.py` — 脱离式启动器标志与 `watchdog.py` 相同。在完全脱离的进程中启动 watchdog，即使终端关闭也能存活。 ### `report.py` — 独立报告 | 标志 | 默认值 | 描述 | |---|---|---| | `--db` | `./output/vxdl.db` | SQLite 数据库的路径 | | `--out` | `./output` | 写入 `report.md` 和 `report.txt` 的目录 | ## 环境变量 | 变量 | 默认值 | 描述 | |---|---|---| | `VXUG_OUT` | `./output` | 用于下载、数据库和日志的输出目录 | | `VXUG_EDGE` | 自动检测 | `msedge.exe` 的完整路径 | | `VXUG_CONCURRENCY` | `4` | 默认下载工作线程数 | ## 流水线阶段 ``` Stage 0 Feasibility Python ≥3.10, packages, Edge binary, disk space Stage 1 Download Crawl LiveView tree → discover URLs → download concurrently Resume: already-done URLs skipped via SQLite Stage 2 Sanitation Check magic bytes (ZIP, 7z, RAR, PDF, MZ, ELF…) + SHA-256 Stage 3 Classification Tag platform / impact / malware class from path + extension Stage 4 Report Print + save report.txt to output directory ``` ## 输出结构 ``` output/ ├── vxdl.db SQLite — all URLs, status, SHA-256, local path, timestamps ├── manifest.csv Append-only download log (ts, section, folder, file, size, sha256, url) ├── vxdl.log Full pipeline log ├── report.txt Collection summary (generated by Stage 4) ├── Builders/ │ ├── NjRat/ │ │ ├── NjRat 0.7d.zip │ │ └── ... │ ├── DarkComet/ │ └── ... ├── Papers/ └── Samples/ ├── Argus Collection/ ├── Virusshare Collection/ └── Bazaar Collection/ ``` ## 文件概览 | 文件 | 用途 | |---|---| | [`vxdl.py`](vxdl.py) | 主流水线 —— 爬取 + 下载 + 清理 + 分类 + 报告 | | [`watchdog.py`](watchdog.py) | 在崩溃或挂起时重启 `vxdl.py`，保持在时间预算内 | | [`launch.py`](launch.py) | 脱离式启动 watchdog（在终端/会话关闭后依然存活） | | [`report.py`](report.py) | 从现有数据库生成独立报告 —— 在下载过程中运行也是安全的 | | [`requirements.txt`](requirements.txt) | Python 依赖项 | | [`pyproject.toml`](pyproject.toml) | 项目元数据 | ## 故障排除 **找不到 Edge** 将 `VXUG_EDGE` 设置为 `msedge.exe` 的完整路径，或者从 [microsoft.com/edge](https://www.microsoft.com/edge) 安装 Edge。 **Cloudflare 始终不放行** 增加 `--cf-timeout 120`。如果仍然失败，可能是 IP 被暂时封锁了 —— 等待几分钟后重试。watchdog 会自动处理此问题。 **`found 0` 个文件** 增加 `--max-depth`（默认为 5，某些 Samples 子集需要 6+）。 **下载卡住 / 无进展** watchdog 会检测挂起（默认 360 秒磁盘无增长）并自动重启。可通过 `launch.py` 进行长会话运行。 **崩溃后恢复** 重新运行相同的命令 —— SQLite 数据库会跟踪每一个 URL。已完成的下载将被立即跳过。 **Windows 上文件名出现 `[Errno 22]`** `_safe_name()` 会去除非法字符（`< > : " | ? *`）以及末尾的点和空格。如果仍然遇到此问题，请检查 `vxdl.log` 中的违规路径。 ## 法律 / 道德声明此工具旨在用于恶意软件研究、威胁情报和学术研究。 VX-Underground 发布这些样本是为了服务于安全研究社区。请勿在隔离的分析环境（无网络的虚拟机或专用沙箱）之外执行样本。您需对自己使用此工具及下载内容的行为负全部责任。

标签：BeEF, Elixir, 反爬虫绕过, 恶意软件, 爬虫, 自动化下载器, 计算机取证, 逆向工具