DXN1-termux/darkwebscraper-pro

GitHub: DXN1-termux/darkwebscraper-pro

一款支持十层去重和多核并行处理的暗网 .onion 链接采集与情报引擎，专为跨会话零重复数据收集而设计。

Stars: 0 | Forks: 0

由 DXN1 用 ❤️ 打造

🕸️ 史上最危险的 Onion 情报引擎 🕸️

诞生于暗影。运行于黑暗。洞察一切。

[![Python](https://img.shields.io/badge/Python-3.8%2B-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![License](https://img.shields.io/badge/License-MIT-22c55e?style=for-the-badge&logo=open-source-initiative&logoColor=white)](LICENSE) [![Platform](https://img.shields.io/badge/Termux%20%7C%20Linux%20%7C%20macOS-supported-f97316?style=for-the-badge&logo=linux&logoColor=white)](https://termux.dev) [![Tor](https://img.shields.io/badge/Tor-Compatible-7D4698?style=for-the-badge&logo=tor-project&logoColor=white)](https://www.torproject.org) [![去重层](https://img.shields.io/badge/Dedup%20Layers-10%20Layers%20Deep-dc2626?style=for-the-badge&logo=databricks&logoColor=white)]() [![并行](https://img.shields.io/badge/Processing-Multicore%20Parallel-8b5cf6?style=for-the-badge&logo=intel&logoColor=white)]() [![状态](https://img.shields.io/badge/State-Persistent%20JSON-0ea5e9?style=for-the-badge&logo=json&logoColor=white)]() [![模糊](https://img.shields.io/badge/Matching-Fuzzy%20%2B%20Phonetic-ec4899?style=for-the-badge&logo=soundcloud&logoColor=white)]() [![欢迎 PR](https://img.shields.io/badge/PRs-Welcome%20👊-ff69b4?style=for-the-badge)](https://github.com/DXN1-termux/darkwebscraper-pro/pulls) [![Stars](https://img.shields.io/github/stars/DXN1-termux/darkwebscraper-pro?style=for-the-badge&logo=github&color=fbbf24)](https://github.com/DXN1-termux/darkwebscraper-pro/stargazers) [![Forks](https://img.shields.io/github/forks/DXN1-termux/darkwebscraper-pro?style=for-the-badge&logo=github&color=34d399)](https://github.com/DXN1-termux/darkwebscraper-pro/network/members) [![最近提交](https://img.shields.io/github/last-commit/DXN1-termux/darkwebscraper-pro?style=for-the-badge&logo=git&color=f59e0b)](https://github.com/DXN1-termux/darkwebscraper-pro/commits/main) [![仓库大小](https://img.shields.io/github/repo-size/DXN1-termux/darkwebscraper-pro?style=for-the-badge&logo=files&color=a78bfa)]() [![代码大小](https://img.shields.io/github/languages/code-size/DXN1-termux/darkwebscraper-pro?style=for-the-badge&color=fb923c)]() [![主要语言](https://img.shields.io/github/languages/top/DXN1-termux/darkwebscraper-pro?style=for-the-badge&logo=python&color=3b82f6)]() Typing SVG

## 🩸 什么是 DarkWeb Scraper Pro？ **DarkWeb Scraper Pro** 不仅仅是一个爬虫。它是一个从底层开始设计的 **10 层 onion 情报系统**，旨在以极高的精准度收集、验证和去重 `.onion` 链接。它可以在手机的 Termux 上运行，可以在你的笔记本上运行，可以在任何地方运行。而且它**绝对不会重复收集同一个链接**——无论是跨重启、跨会话，还是跨次元。专为 OSINT 研究人员、安全工程师、红队成员以及任何需要**干净、经过验证、零噪音的暗网情报数据**的人而打造。 ## 🧬 完整系统架构 ``` ╔═══════════════════════════════════════════════════════════════════════════╗ ║ DarkWeb Scraper Pro v2.0 — System Map ║ ╠══════════════════════╦══════════════════════╦════════════════════════════╣ ║ 🔍 SEARCH ENGINE ║ 🧠 DEDUP ENGINE v2 ║ ⚡ PARALLEL PROCESSOR ║ ║ ────────────────── ║ ─────────────────── ║ ──────────────────────── ║ ║ torch.cx querying ║ 10 dedup layers ║ ProcessPoolExecutor ║ ║ HTML + BS4 parse ║ Hash + Domain ║ Auto CPU core detect ║ ║ Redirect decoding ║ Phonetic + Fuzzy ║ Min(max(4, cores), 6) ║ ║ URL extraction ║ Canonical + Bigram ║ Chunk splitting ║ ║ .onion validation ║ Semantic fingerprint ║ Cross-chunk merge pass ║ ╠══════════════════════╬══════════════════════╬════════════════════════════╣ ║ 💾 PERSISTENCE ║ 📊 STAGE PRINTER ║ 📂 FILE ENGINE ║ ║ ────────────────── ║ ─────────────────── ║ ──────────────────────── ║ ║ seen_data.json ║ Emoji stage markers ║ Box-drawing output ║ ║ 8 state vectors ║ Timed stage blocks ║ Append + full rewrite ║ ║ Load on boot ║ Stats boxes ║ UTF-8 enforced ║ ║ Save after each add ║ Dividers + banners ║ Integrity checking ║ ╚══════════════════════╩══════════════════════╩════════════════════════════╝ ``` ## ⚡ 功能一览 | 🔥 功能 | 💬 作用 | |:---|:---| | 🔍 **实时搜索** | 通过 HTTPS 实时查询 `torch.cx`（Tor 搜索引擎） | | 🧹 **10 层去重** | 你见过的最极致的去重系统 | | ⚡ **并行核心** | 使用 `ProcessPoolExecutor` 最多生成 6 个工作进程 | | 💾 **持久化状态** | `seen_data.json` 在会话之间保存所有去重状态 | | 🔗 **完整性检查** | 标记未正确以 `.onion` 或 `.onion/` 结尾的链接 | | 📊 **丰富输出** | Emoji 阶段、耗时信息、统计框、ASCII 分隔线 | | 📂 **结构化文件** | `darkweb.txt` 使用制表符，每条记录带有时间戳 | | 📥 **Android 导出** | 输入 `download` → 自动复制到你的 Downloads 文件夹 | | 🔄 **增量更新** | 随时重新运行。仅保存全新的独立结果。 | | 🤖 **模糊引擎** | Jaccard 相似度 + SequenceMatcher + Soundex 语音匹配 | | 🧠 **语义匹配** | 关键词指纹 + bigram 词哈希 | | 🌐 **规范化 URL** | 去除索引页面、版本字符串、id 参数 | | 🏎️ **CPU 检测** | `min(max(4, os.cpu_count()), 6)` — 始终保持最优 | ## 🧠 去重引擎 — 全部 10 层详解 ### 🗺️ 去重挑战 — 完整流程 ``` ┌─────────────────┐ │ NEW LINK INPUT │ │ (title + url) │ └────────┬────────┘ │ ┌────────────▼────────────┐ │ PRE-COMPUTE ALL KEYS │ │ ───────────────────── │ │ • domain extract │ │ • normalize URL │ │ • content MD5 hash │ │ • path signature │ │ • canonical URL │ │ • phonetic hash │ │ • content fingerprint │ └────────────┬────────────┘ │ ╔══════════════════════╪══════════════════════╗ ║ LAYER 1 — Exact Content Hash ║ ║ MD5( title.lower() + "|" + normalized_url )║ ║ Cheapest possible. O(1) set lookup. ║ ╚══════════════════╤═══════════════════════╤══╝ FAIL ──┘ │ PASS 💀 DUPE │ ╔════════════════════════════════╗ │ ║ LAYER 2 — Domain Duplicate ║◄────────┘ ║ Extracts the raw .onion base ║ ║ domain (16 or 56 char hash). ║ ║ Same domain? You're dead. ║ ╚═══════════════╤════════════════╝ FAIL ──┘ │ PASS 💀 DUPE │ ╔══════════════════════╗ │ ║ LAYER 3 — Norm URL ║◄─┘ ║ Strips: ref, utm_*, ║ ║ fbclid, gclid, ║ ║ session, source, id ║ ║ Sorts remaining ║ ║ query params. ║ ╚════════╤═════════════╝ FAIL ──┘ │ PASS 💀 DUPE │ ╔═══════════════▼══════════════════╗ ║ LAYER 4 — Path Signature ║ ║ Lowercases path, collapses //, ║ ║ replaces all digits → {num}. ║ ║ /shop/item/4829 == /shop/item/1 ║ ╚════════════════╤═════════════════╝ FAIL ───┘ │ PASS 💀 DUPE │ ╔════════════════════════▼═══════════════════╗ ║ LAYER 5 — Content Fingerprint ║ ║ domain + top 5 significant title keywords ║ ║ Strips stopwords, sorts, joins with | ║ ║ Stored as an MD5 in the seen_hashes set ║ ╚════════════════════════╤═══════════════════╝ FAIL ───┘ │ PASS 💀 DUPE │ ╔════════════════════════════════▼════════════╗ ║ LAYER 6 — Phonetic Hash (Soundex-style) ║ ║ Strips non-alpha, maps consonant groups: ║ ║ bfpv→1 cgjkqsxz→2 dt→3 l→4 mn→5 r→6║ ║ "Hacking" ≈ "Hakking" ≈ "Hcking" ║ ║ Catches typo variants of the same site ║ ╚═══════════════════════╤════════════════════╝ FAIL ───┘ │ PASS 💀 DUPE │ ╔═════════════════════════════════▼══════════╗ ║ LAYER 7 — Canonical URL ║ ║ Strips: index.php / home / default.asp ║ ║ Normalizes version strings → {ver} ║ ║ Normalizes /id/123 → /{id} ║ ║ /store/v2.1/item/id/99 == /store/{ver}/ ║ ║ item/{id} ← caught! ║ ╚═══════════════════╤════════════════════════╝ FAIL ───┘ │ PASS 💀 DUPE │ ╔═══════════════════════════▼═══════════════════╗ ║ LAYER 8 — Title Bigram Word Hash Overlap ║ ║ Tokenizes title, removes stopwords ║ ║ Computes MD5 of every consecutive word pair ║ ║ AND top 3 individual words ║ ║ Checks overlap ratio against seen pool: ║ ║ overlap / union > 0.60 → DUPE ║ ║ Catches reworded titles for same site ║ ╚════════════════════╤══════════════════════════╝ FAIL ───┘ │ PASS 💀 DUPE │ ╔════════════════════════════▼════════════════════╗ ║ LAYER 9 — Fuzzy Title Similarity ║ ║ For every existing title in memory: ║ ║ Step A: strip punctuation, lowercase both ║ ║ Step B: Jaccard on word sets — if > 0.85 → ✗ ║ ║ Step C: SequenceMatcher ratio — if > 0.90 → ✗ ║ ║ "The Hidden Wiki 2024" ≈ "Hidden Wiki (2024)" ║ ║ ← caught by Jaccard ║ ╚══════════════════╤══════════════════════════════╝ FAIL ───┘ │ PASS 💀 DUPE │ ╔══════════════════════════▼══════════════════════╗ ║ LAYER 10 — Fuzzy URL Similarity ║ ║ For every existing link in memory: ║ ║ • Must share same netloc (different host = OK)║ ║ • SequenceMatcher on path → if > 0.90 → ✗ ║ ║ • OR: same path + same query param keys → ✗ ║ ║ http://abc.onion/shop?id=1 ≈ ║ ║ http://abc.onion/shop?id=2 ← caught! ║ ╚══════════════════╤══════════════════════════════╝ FAIL ───┘ │ PASS ✅ 💀 DUPE │ ┌───────▼──────┐ │ ACCEPTED ✅ │ │ ADD TO ALL │ │ 8 STATE SETS │ │ SAVE TO JSON │ └──────────────┘ ``` ### 📐 逐层技术参考 | # | 层级名称 | 算法 | 数据结构 | 开销 | |:---:|:---|:---|:---|:---:| | 1 | **精确内容哈希** | MD5 of `title.lower() \| normalize(url)` | `set` | 🟢 O(1) | | 2 | **域名重复** | 正则提取 `.onion` 16/56 字符哈希 | `set` | 🟢 O(1) | | 3 | **规范化 URL** | 去除跟踪参数 + 对剩余部分排序 | `set` | 🟢 O(1) | | 4 | **路径签名** | 小写 + 合并 `//` + `\d+→{num}` | `set` | 🟢 O(1) | | 5 | **内容指纹** | 域名 + 标题前 5 个非停用词 token | `set` (MD5) | 🟡 O(n words) | | 6 | **语音哈希** | Soundex 风格的辅音组映射 | `set` | 🟡 O(n chars) | | 7 | **规范 URL** | 去除 index/home/default + `{ver}` + `{id}` | `set` | 🟡 O(path len) | | 8 | **Bigram 词哈希** | 连续重要词对的 MD5 | `set` (overlap ratio) | 🟠 O(n²) | | 9 | **模糊标题** | 基于词集的 Jaccard + SequenceMatcher | `list` (linear scan) | 🔴 O(n·m) | | 10 | **模糊 URL** | 相同 netloc + 路径上的 SequenceMatcher | `list` (linear scan) | 🔴 O(n·m) | ### 🧪 真实去重示例 ``` # Layer 2 — Domain 捕获指向相同 .onion 的备用路径 "http://exampledomain56chars.onion/" ← SEEN "http://exampledomain56chars.onion/login" → 💀 DUPLICATE (same domain) # Layer 3 — 归一化 URL 去除跟踪垃圾信息 "http://site.onion/page?ref=telegram&id=99" "http://site.onion/page?id=99" → 💀 DUPLICATE (normalized match) # Layer 4 — Path signature 合并数字 "http://shop.onion/item/4829" ← SEEN "http://shop.onion/item/9173" → 💀 DUPLICATE (path sig: /item/{num}) # Layer 6 — 语音匹配捕获拼写错误变体 "Empire Market" ← SEEN title "Empyre Market" → 💀 DUPLICATE (same phonetic hash E516) # Layer 7 — Canonical 去除版本/索引噪声 "http://site.onion/v2.1/index.php" ← SEEN "http://site.onion/v3.0/" → 💀 DUPLICATE (canonical: site.onion) # Layer 9 — 模糊标题捕获改写 "The Hidden Wiki – Updated 2026 Edition" ← SEEN "Hidden Wiki 2026 (Updated)" → 💀 DUPLICATE (Jaccard > 0.85) # Layer 10 — 模糊 URL 捕获参数变化 "http://market.onion/listing?id=1&cat=drugs" ← SEEN "http://market.onion/listing?id=2&cat=drugs" → 💀 DUPLICATE (same keys) ``` ## ⚡ 并行处理架构 ``` ┌───────────────────────────┐ │ entries[] — N items │ └────────────┬──────────────┘ │ split into chunks ┌────────┬──────────┼──────────┬────────┐ ▼ ▼ ▼ ▼ ▼ [chunk 0] [chunk 1] [chunk 2] [chunk 3] [chunk 4] │ │ │ │ │ ┌──────▼──┐ ┌───▼─────┐ ┌─▼──────┐ ... ... │ │Worker 0 │ │Worker 1 │ │Worker 2 │ │ │DedupEng │ │DedupEng │ │DedupEng │ │ │(no load)│ │(no load)│ │(no load)│ │ └──────┬──┘ └───┬─────┘ └─┬──────┘ │ └────────┴──────────┴─────────────────┘ │ ┌────────────▼───────────────┐ │ SORT results by chunk_idx │ │ MERGE into final_engine │ │ Cross-chunk dedup pass │ │ (catches inter-chunk dups) │ └────────────┬───────────────┘ │ ┌────────────▼───────────────┐ │ ✅ Final unique entries[] │ └───────────────────────────┘ Threshold: ≥50 entries → parallel. <50 → single engine (faster for small sets) Workers: min(max(4, os.cpu_count()), 6) — always right-sized ``` ## 💾 持久化状态 — 8 个状态向量每次接受一个新链接时，它都会被添加到 `seen_data.json` 中的 **8 个独立状态向量**中： ``` { "hashes": ["md5hex...", ...], // Layer 1 + 5: exact hash + content fingerprint "domains": ["abc123.onion", ...],// Layer 2: base .onion domain "paths": ["/item/{num}", ...], // Layer 4: path signatures "titles": ["Full Title...", ...],// Layer 9: fuzzy title list "links": ["http://...", ...], // Layer 3 + 10: normalized URLs "phonetic": ["H516", ...], // Layer 6: Soundex-style hashes "canonical": ["site.onion/...", ...],// Layer 7: canonical URLs "title_hashes":["a1b2c3d4", ...] // Layer 8: bigram word hash pool } ``` ## 📦 安装说明 ### 🤖 Termux (Android) — 推荐 ``` # 1. 更新 packages + 安装 deps pkg update -y && pkg install python git -y # 2. Clone repo git clone https://github.com/DXN1-termux/darkwebscraper-pro.git cd darkwebscraper-pro # 3. 安装 Python deps pip install -r requirements.txt # 4. 启动 python darkwebscraper-pro.py ``` ### 🐧 Linux / macOS ``` # Clone git clone https://github.com/DXN1-termux/darkwebscraper-pro.git cd darkwebscraper-pro # Virtual env（推荐） python3 -m venv .venv && source .venv/bin/activate # Deps pip install -r requirements.txt # 运行 python3 darkwebscraper-pro.py ``` ### 💉 一键安装并运行 ``` git clone https://github.com/DXN1-termux/darkwebscraper-pro.git && cd darkwebscraper-pro && pip install -r requirements.txt && python3 darkwebscraper-pro.py ``` ### 📦 pip（仅限手动依赖） ``` pip install requests beautifulsoup4 python3 darkwebscraper-pro.py ``` ## 🚀 使用说明 ``` python3 darkwebscraper-pro.py ``` 你将看到启动横幅和系统统计框，随后搜索循环将开始： ``` ╔══════════════════════════════════════════════════════════╗ ║ 🔥 🕸 Dark Web Scraper · v2.0 ║ ╚══════════════════════════════════════════════════════════╝ ┌──────────────────────────────────────────────────────┐ │ 🔎 System │ ├──────────────────────────────────────────────────────┤ │ CPU cores detected: 8 │ │ Cores being used: 6 │ │ Deduplication layers: 10 │ │ Parallel threshold: 50+ entries │ │ Output file: darkweb.txt │ └──────────────────────────────────────────────────────┘ ``` 然后只需输入任何查询内容： ``` ──────────────────────────────────────────────────────── 🔎 Query (or 'quit'): darknet markets ──────────────────────────────────────────────────────── ``` ### 🎮 特殊命令 | 命令 | 操作 | |:---|:---| | `quit` / `q` / `exit` | 最终去重清理 → 干净地退出 | | `download` | 去重 → 将 `darkweb.txt` 复制到 Downloads | ## 📄 输出格式每个被接受的结果都会使用制表符保存在 `darkweb.txt` 中： ``` ┌───────────────────────────────────────────────────────────────┐ │ 📌 Hidden Market — Full Access │ │ 🔗 http://exampleabcdefghij1234567890.onion/register │ │ 🕐 2026-06-19 19:00:00 │ └───────────────────────────────────────────────────────────────┘ ┌───────────────────────────────────────────────────────────────┐ │ 📌 SecureDrop Portal — Whistleblower Hub │ │ 🔗 http://sdolvtfhatvsysc6l34d65ymdwxcujausv7k5jk4cy5ttzhjoi6fzvad.onion │ │ 🕐 2026-06-19 19:00:01 │ └───────────────────────────────────────────────────────────────┘ ``` ## 📁 项目结构 ``` darkwebscraper-pro/ ├── darkwebscraper-pro.py # 🧠 The entire engine (703 lines) │ ├── StagePrinter # Rich emoji progress printer │ ├── DedupEngine # 10-layer deduplication system │ ├── parallel_dedup() # ProcessPoolExecutor orchestrator │ ├── load_existing_data()# File parser with integrity checks │ ├── save_deduplicated_file() # Full rewrite with dedup pass │ ├── get_onion_links_and_titles() # Live torch.cx scraper │ ├── save_results_to_file() # Appends new results │ └── copy_output_to_downloads() # Android export helper ├── requirements.txt # requests, beautifulsoup4 ├── .gitignore # Ignores runtime files ├── LICENSE # MIT ├── README.md # This glorious document ├── seen_data.json # [auto] Persistent dedup state └── darkweb.txt # [auto] Collected .onion results ``` ## 🛠️ 依赖项 | 包 | 版本 | 用途 | |:---|:---|:---| | `requests` | `≥2.28.0` | 用于 torch.cx 请求的 HTTP 客户端 | | `beautifulsoup4` | `≥4.11.0` | HTML 解析 + 结果提取 | | `re` *(标准库)* | — | URL 解析，模式匹配 | | `hashlib` *(标准库)* | — | 用于去重的 MD5 指纹识别 | | `difflib` *(标准库)* | — | 用于模糊匹配的 SequenceMatcher | | `concurrent.futures` *(标准库)* | — | 用于并行的 ProcessPoolExecutor | | `multiprocessing` *(标准库)* | — | CPU 核心数检测 | | `json` *(标准库)* | — | 持久化状态序列化 | | `collections` *(标准库)* | — | 用于统计跟踪的 `defaultdict` | | `urllib.parse` *(标准库)* | — | URL 规范化 + 解码 | ## ⚠️ 法律声明

``` · · · · · · · · · · · · · · · · · · · · · Made with 🖤 in the dark by DXN1-termux · · · · · · · · · · · · · · · · · · · · · ``` [![GitHub](https://img.shields.io/badge/GitHub-DXN1--termux-181717?style=for-the-badge&logo=github)](https://github.com/DXN1-termux) [![Stars](https://img.shields.io/github/stars/DXN1-termux/darkwebscraper-pro?style=for-the-badge&logo=github&color=fbbf24&label=⭐%20Stars)](https://github.com/DXN1-termux/darkwebscraper-pro/stargazers)

标签：Homebrew安装, Python, Tor, 威胁情报, 开发者工具, 无后门, 暗网情报, 逆向工具