DXN1-termux/darkwebscraper-pro
GitHub: DXN1-termux/darkwebscraper-pro
一款支持十层去重和多核并行处理的暗网 .onion 链接采集与情报引擎,专为跨会话零重复数据收集而设计。
Stars: 0 | Forks: 0
#
由 DXN1 用 ❤️ 打造
## 🩸 什么是 DarkWeb Scraper Pro?
**DarkWeb Scraper Pro** 不仅仅是一个爬虫。它是一个从底层开始设计的 **10 层 onion 情报系统**,旨在以极高的精准度收集、验证和去重 `.onion` 链接。它可以在手机的 Termux 上运行,可以在你的笔记本上运行,可以在任何地方运行。而且它**绝对不会重复收集同一个链接**——无论是跨重启、跨会话,还是跨次元。
专为 OSINT 研究人员、安全工程师、红队成员以及任何需要**干净、经过验证、零噪音的暗网情报数据**的人而打造。
## 🧬 完整系统架构
```
╔═══════════════════════════════════════════════════════════════════════════╗
║ DarkWeb Scraper Pro v2.0 — System Map ║
╠══════════════════════╦══════════════════════╦════════════════════════════╣
║ 🔍 SEARCH ENGINE ║ 🧠 DEDUP ENGINE v2 ║ ⚡ PARALLEL PROCESSOR ║
║ ────────────────── ║ ─────────────────── ║ ──────────────────────── ║
║ torch.cx querying ║ 10 dedup layers ║ ProcessPoolExecutor ║
║ HTML + BS4 parse ║ Hash + Domain ║ Auto CPU core detect ║
║ Redirect decoding ║ Phonetic + Fuzzy ║ Min(max(4, cores), 6) ║
║ URL extraction ║ Canonical + Bigram ║ Chunk splitting ║
║ .onion validation ║ Semantic fingerprint ║ Cross-chunk merge pass ║
╠══════════════════════╬══════════════════════╬════════════════════════════╣
║ 💾 PERSISTENCE ║ 📊 STAGE PRINTER ║ 📂 FILE ENGINE ║
║ ────────────────── ║ ─────────────────── ║ ──────────────────────── ║
║ seen_data.json ║ Emoji stage markers ║ Box-drawing output ║
║ 8 state vectors ║ Timed stage blocks ║ Append + full rewrite ║
║ Load on boot ║ Stats boxes ║ UTF-8 enforced ║
║ Save after each add ║ Dividers + banners ║ Integrity checking ║
╚══════════════════════╩══════════════════════╩════════════════════════════╝
```
## ⚡ 功能一览
| 🔥 功能 | 💬 作用 |
|:---|:---|
| 🔍 **实时搜索** | 通过 HTTPS 实时查询 `torch.cx`(Tor 搜索引擎) |
| 🧹 **10 层去重** | 你见过的最极致的去重系统 |
| ⚡ **并行核心** | 使用 `ProcessPoolExecutor` 最多生成 6 个工作进程 |
| 💾 **持久化状态** | `seen_data.json` 在会话之间保存所有去重状态 |
| 🔗 **完整性检查** | 标记未正确以 `.onion` 或 `.onion/` 结尾的链接 |
| 📊 **丰富输出** | Emoji 阶段、耗时信息、统计框、ASCII 分隔线 |
| 📂 **结构化文件** | `darkweb.txt` 使用制表符,每条记录带有时间戳 |
| 📥 **Android 导出** | 输入 `download` → 自动复制到你的 Downloads 文件夹 |
| 🔄 **增量更新** | 随时重新运行。仅保存全新的独立结果。 |
| 🤖 **模糊引擎** | Jaccard 相似度 + SequenceMatcher + Soundex 语音匹配 |
| 🧠 **语义匹配** | 关键词指纹 + bigram 词哈希 |
| 🌐 **规范化 URL** | 去除索引页面、版本字符串、id 参数 |
| 🏎️ **CPU 检测** | `min(max(4, os.cpu_count()), 6)` — 始终保持最优 |
## 🧠 去重引擎 — 全部 10 层详解
### 🗺️ 去重挑战 — 完整流程
```
┌─────────────────┐
│ NEW LINK INPUT │
│ (title + url) │
└────────┬────────┘
│
┌────────────▼────────────┐
│ PRE-COMPUTE ALL KEYS │
│ ───────────────────── │
│ • domain extract │
│ • normalize URL │
│ • content MD5 hash │
│ • path signature │
│ • canonical URL │
│ • phonetic hash │
│ • content fingerprint │
└────────────┬────────────┘
│
╔══════════════════════╪══════════════════════╗
║ LAYER 1 — Exact Content Hash ║
║ MD5( title.lower() + "|" + normalized_url )║
║ Cheapest possible. O(1) set lookup. ║
╚══════════════════╤═══════════════════════╤══╝
FAIL ──┘ │ PASS
💀 DUPE │
╔════════════════════════════════╗ │
║ LAYER 2 — Domain Duplicate ║◄────────┘
║ Extracts the raw .onion base ║
║ domain (16 or 56 char hash). ║
║ Same domain? You're dead. ║
╚═══════════════╤════════════════╝
FAIL ──┘ │ PASS
💀 DUPE │
╔══════════════════════╗ │
║ LAYER 3 — Norm URL ║◄─┘
║ Strips: ref, utm_*, ║
║ fbclid, gclid, ║
║ session, source, id ║
║ Sorts remaining ║
║ query params. ║
╚════════╤═════════════╝
FAIL ──┘ │ PASS
💀 DUPE │
╔═══════════════▼══════════════════╗
║ LAYER 4 — Path Signature ║
║ Lowercases path, collapses //, ║
║ replaces all digits → {num}. ║
║ /shop/item/4829 == /shop/item/1 ║
╚════════════════╤═════════════════╝
FAIL ───┘ │ PASS
💀 DUPE │
╔════════════════════════▼═══════════════════╗
║ LAYER 5 — Content Fingerprint ║
║ domain + top 5 significant title keywords ║
║ Strips stopwords, sorts, joins with | ║
║ Stored as an MD5 in the seen_hashes set ║
╚════════════════════════╤═══════════════════╝
FAIL ───┘ │ PASS
💀 DUPE │
╔════════════════════════════════▼════════════╗
║ LAYER 6 — Phonetic Hash (Soundex-style) ║
║ Strips non-alpha, maps consonant groups: ║
║ bfpv→1 cgjkqsxz→2 dt→3 l→4 mn→5 r→6║
║ "Hacking" ≈ "Hakking" ≈ "Hcking" ║
║ Catches typo variants of the same site ║
╚═══════════════════════╤════════════════════╝
FAIL ───┘ │ PASS
💀 DUPE │
╔═════════════════════════════════▼══════════╗
║ LAYER 7 — Canonical URL ║
║ Strips: index.php / home / default.asp ║
║ Normalizes version strings → {ver} ║
║ Normalizes /id/123 → /{id} ║
║ /store/v2.1/item/id/99 == /store/{ver}/ ║
║ item/{id} ← caught! ║
╚═══════════════════╤════════════════════════╝
FAIL ───┘ │ PASS
💀 DUPE │
╔═══════════════════════════▼═══════════════════╗
║ LAYER 8 — Title Bigram Word Hash Overlap ║
║ Tokenizes title, removes stopwords ║
║ Computes MD5 of every consecutive word pair ║
║ AND top 3 individual words ║
║ Checks overlap ratio against seen pool: ║
║ overlap / union > 0.60 → DUPE ║
║ Catches reworded titles for same site ║
╚════════════════════╤══════════════════════════╝
FAIL ───┘ │ PASS
💀 DUPE │
╔════════════════════════════▼════════════════════╗
║ LAYER 9 — Fuzzy Title Similarity ║
║ For every existing title in memory: ║
║ Step A: strip punctuation, lowercase both ║
║ Step B: Jaccard on word sets — if > 0.85 → ✗ ║
║ Step C: SequenceMatcher ratio — if > 0.90 → ✗ ║
║ "The Hidden Wiki 2024" ≈ "Hidden Wiki (2024)" ║
║ ← caught by Jaccard ║
╚══════════════════╤══════════════════════════════╝
FAIL ───┘ │ PASS
💀 DUPE │
╔══════════════════════════▼══════════════════════╗
║ LAYER 10 — Fuzzy URL Similarity ║
║ For every existing link in memory: ║
║ • Must share same netloc (different host = OK)║
║ • SequenceMatcher on path → if > 0.90 → ✗ ║
║ • OR: same path + same query param keys → ✗ ║
║ http://abc.onion/shop?id=1 ≈ ║
║ http://abc.onion/shop?id=2 ← caught! ║
╚══════════════════╤══════════════════════════════╝
FAIL ───┘ │ PASS ✅
💀 DUPE │
┌───────▼──────┐
│ ACCEPTED ✅ │
│ ADD TO ALL │
│ 8 STATE SETS │
│ SAVE TO JSON │
└──────────────┘
```
### 📐 逐层技术参考
| # | 层级名称 | 算法 | 数据结构 | 开销 |
|:---:|:---|:---|:---|:---:|
| 1 | **精确内容哈希** | MD5 of `title.lower() \| normalize(url)` | `set` | 🟢 O(1) |
| 2 | **域名重复** | 正则提取 `.onion` 16/56 字符哈希 | `set` | 🟢 O(1) |
| 3 | **规范化 URL** | 去除跟踪参数 + 对剩余部分排序 | `set` | 🟢 O(1) |
| 4 | **路径签名** | 小写 + 合并 `//` + `\d+→{num}` | `set` | 🟢 O(1) |
| 5 | **内容指纹** | 域名 + 标题前 5 个非停用词 token | `set` (MD5) | 🟡 O(n words) |
| 6 | **语音哈希** | Soundex 风格的辅音组映射 | `set` | 🟡 O(n chars) |
| 7 | **规范 URL** | 去除 index/home/default + `{ver}` + `{id}` | `set` | 🟡 O(path len) |
| 8 | **Bigram 词哈希** | 连续重要词对的 MD5 | `set` (overlap ratio) | 🟠 O(n²) |
| 9 | **模糊标题** | 基于词集的 Jaccard + SequenceMatcher | `list` (linear scan) | 🔴 O(n·m) |
| 10 | **模糊 URL** | 相同 netloc + 路径上的 SequenceMatcher | `list` (linear scan) | 🔴 O(n·m) |
### 🧪 真实去重示例
```
# Layer 2 — Domain 捕获指向相同 .onion 的备用路径
"http://exampledomain56chars.onion/" ← SEEN
"http://exampledomain56chars.onion/login" → 💀 DUPLICATE (same domain)
# Layer 3 — 归一化 URL 去除跟踪垃圾信息
"http://site.onion/page?ref=telegram&id=99"
"http://site.onion/page?id=99" → 💀 DUPLICATE (normalized match)
# Layer 4 — Path signature 合并数字
"http://shop.onion/item/4829" ← SEEN
"http://shop.onion/item/9173" → 💀 DUPLICATE (path sig: /item/{num})
# Layer 6 — 语音匹配捕获拼写错误变体
"Empire Market" ← SEEN title
"Empyre Market" → 💀 DUPLICATE (same phonetic hash E516)
# Layer 7 — Canonical 去除版本/索引噪声
"http://site.onion/v2.1/index.php" ← SEEN
"http://site.onion/v3.0/" → 💀 DUPLICATE (canonical: site.onion)
# Layer 9 — 模糊标题捕获改写
"The Hidden Wiki – Updated 2026 Edition" ← SEEN
"Hidden Wiki 2026 (Updated)" → 💀 DUPLICATE (Jaccard > 0.85)
# Layer 10 — 模糊 URL 捕获参数变化
"http://market.onion/listing?id=1&cat=drugs" ← SEEN
"http://market.onion/listing?id=2&cat=drugs" → 💀 DUPLICATE (same keys)
```
## ⚡ 并行处理架构
```
┌───────────────────────────┐
│ entries[] — N items │
└────────────┬──────────────┘
│ split into chunks
┌────────┬──────────┼──────────┬────────┐
▼ ▼ ▼ ▼ ▼
[chunk 0] [chunk 1] [chunk 2] [chunk 3] [chunk 4]
│ │ │ │ │
┌──────▼──┐ ┌───▼─────┐ ┌─▼──────┐ ... ... │
│Worker 0 │ │Worker 1 │ │Worker 2 │ │
│DedupEng │ │DedupEng │ │DedupEng │ │
│(no load)│ │(no load)│ │(no load)│ │
└──────┬──┘ └───┬─────┘ └─┬──────┘ │
└────────┴──────────┴─────────────────┘
│
┌────────────▼───────────────┐
│ SORT results by chunk_idx │
│ MERGE into final_engine │
│ Cross-chunk dedup pass │
│ (catches inter-chunk dups) │
└────────────┬───────────────┘
│
┌────────────▼───────────────┐
│ ✅ Final unique entries[] │
└───────────────────────────┘
Threshold: ≥50 entries → parallel. <50 → single engine (faster for small sets)
Workers: min(max(4, os.cpu_count()), 6) — always right-sized
```
## 💾 持久化状态 — 8 个状态向量
每次接受一个新链接时,它都会被添加到 `seen_data.json` 中的 **8 个独立状态向量**中:
```
{
"hashes": ["md5hex...", ...], // Layer 1 + 5: exact hash + content fingerprint
"domains": ["abc123.onion", ...],// Layer 2: base .onion domain
"paths": ["/item/{num}", ...], // Layer 4: path signatures
"titles": ["Full Title...", ...],// Layer 9: fuzzy title list
"links": ["http://...", ...], // Layer 3 + 10: normalized URLs
"phonetic": ["H516", ...], // Layer 6: Soundex-style hashes
"canonical": ["site.onion/...", ...],// Layer 7: canonical URLs
"title_hashes":["a1b2c3d4", ...] // Layer 8: bigram word hash pool
}
```
## 📦 安装说明
### 🤖 Termux (Android) — 推荐
```
# 1. 更新 packages + 安装 deps
pkg update -y && pkg install python git -y
# 2. Clone repo
git clone https://github.com/DXN1-termux/darkwebscraper-pro.git
cd darkwebscraper-pro
# 3. 安装 Python deps
pip install -r requirements.txt
# 4. 启动
python darkwebscraper-pro.py
```
### 🐧 Linux / macOS
```
# Clone
git clone https://github.com/DXN1-termux/darkwebscraper-pro.git
cd darkwebscraper-pro
# Virtual env(推荐)
python3 -m venv .venv && source .venv/bin/activate
# Deps
pip install -r requirements.txt
# 运行
python3 darkwebscraper-pro.py
```
### 💉 一键安装并运行
```
git clone https://github.com/DXN1-termux/darkwebscraper-pro.git && cd darkwebscraper-pro && pip install -r requirements.txt && python3 darkwebscraper-pro.py
```
### 📦 pip(仅限手动依赖)
```
pip install requests beautifulsoup4
python3 darkwebscraper-pro.py
```
## 🚀 使用说明
```
python3 darkwebscraper-pro.py
```
你将看到启动横幅和系统统计框,随后搜索循环将开始:
```
╔══════════════════════════════════════════════════════════╗
║ 🔥 🕸 Dark Web Scraper · v2.0 ║
╚══════════════════════════════════════════════════════════╝
┌──────────────────────────────────────────────────────┐
│ 🔎 System │
├──────────────────────────────────────────────────────┤
│ CPU cores detected: 8 │
│ Cores being used: 6 │
│ Deduplication layers: 10 │
│ Parallel threshold: 50+ entries │
│ Output file: darkweb.txt │
└──────────────────────────────────────────────────────┘
```
然后只需输入任何查询内容:
```
────────────────────────────────────────────────────────
🔎 Query (or 'quit'): darknet markets
────────────────────────────────────────────────────────
```
### 🎮 特殊命令
| 命令 | 操作 |
|:---|:---|
| `quit` / `q` / `exit` | 最终去重清理 → 干净地退出 |
| `download` | 去重 → 将 `darkweb.txt` 复制到 Downloads |
## 📄 输出格式
每个被接受的结果都会使用制表符保存在 `darkweb.txt` 中:
```
┌───────────────────────────────────────────────────────────────┐
│ 📌 Hidden Market — Full Access │
│ 🔗 http://exampleabcdefghij1234567890.onion/register │
│ 🕐 2026-06-19 19:00:00 │
└───────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│ 📌 SecureDrop Portal — Whistleblower Hub │
│ 🔗 http://sdolvtfhatvsysc6l34d65ymdwxcujausv7k5jk4cy5ttzhjoi6fzvad.onion │
│ 🕐 2026-06-19 19:00:01 │
└───────────────────────────────────────────────────────────────┘
```
## 📁 项目结构
```
darkwebscraper-pro/
├── darkwebscraper-pro.py # 🧠 The entire engine (703 lines)
│ ├── StagePrinter # Rich emoji progress printer
│ ├── DedupEngine # 10-layer deduplication system
│ ├── parallel_dedup() # ProcessPoolExecutor orchestrator
│ ├── load_existing_data()# File parser with integrity checks
│ ├── save_deduplicated_file() # Full rewrite with dedup pass
│ ├── get_onion_links_and_titles() # Live torch.cx scraper
│ ├── save_results_to_file() # Appends new results
│ └── copy_output_to_downloads() # Android export helper
├── requirements.txt # requests, beautifulsoup4
├── .gitignore # Ignores runtime files
├── LICENSE # MIT
├── README.md # This glorious document
├── seen_data.json # [auto] Persistent dedup state
└── darkweb.txt # [auto] Collected .onion results
```
## 🛠️ 依赖项
| 包 | 版本 | 用途 |
|:---|:---|:---|
| `requests` | `≥2.28.0` | 用于 torch.cx 请求的 HTTP 客户端 |
| `beautifulsoup4` | `≥4.11.0` | HTML 解析 + 结果提取 |
| `re` *(标准库)* | — | URL 解析,模式匹配 |
| `hashlib` *(标准库)* | — | 用于去重的 MD5 指纹识别 |
| `difflib` *(标准库)* | — | 用于模糊匹配的 SequenceMatcher |
| `concurrent.futures` *(标准库)* | — | 用于并行的 ProcessPoolExecutor |
| `multiprocessing` *(标准库)* | — | CPU 核心数检测 |
| `json` *(标准库)* | — | 持久化状态序列化 |
| `collections` *(标准库)* | — | 用于统计跟踪的 `defaultdict` |
| `urllib.parse` *(标准库)* | — | URL 规范化 + 解码 |
## ⚠️ 法律声明
🕸️ 史上最危险的 Onion 情报引擎 🕸️
诞生于暗影。运行于黑暗。洞察一切。
[](https://python.org) [](LICENSE) [](https://termux.dev) [](https://www.torproject.org) []() []() []() []() [](https://github.com/DXN1-termux/darkwebscraper-pro/pulls) [](https://github.com/DXN1-termux/darkwebscraper-pro/stargazers) [](https://github.com/DXN1-termux/darkwebscraper-pro/network/members) [](https://github.com/DXN1-termux/darkwebscraper-pro/commits/main) []() []() []()
```
· · · · · · · · · · · · · · · · · · · · ·
Made with 🖤 in the dark by DXN1-termux
· · · · · · · · · · · · · · · · · · · · ·
```
[](https://github.com/DXN1-termux)
[](https://github.com/DXN1-termux/darkwebscraper-pro/stargazers)
标签:Homebrew安装, Python, Tor, 威胁情报, 开发者工具, 无后门, 暗网情报, 逆向工具