ishara-sampath/megaweb-dns-map-118m

GitHub: ishara-sampath/megaweb-dns-map-118m

该数据集将1.18亿活跃域名映射到解析IP，为威胁情报、基础设施分析和大规模DNS研究提供离线基础数据。

Stars: 0 | Forks: 0

# 🌐 MegaWeb DNS Map — 1.18 亿活跃域名 ## 📦 数据集概述 | 文件 | 描述 | 大小 | |------|-------------|------| | `internet_domains_118m_active.txt` | 1.18 亿活跃域名主列表 | ~3–5 GB | | `domain_to_ip_resolved_118m.txt` | 完整的域名 → IP 解析映射表 | ~6–10 GB | ## 🚀 快速开始 ``` git clone https://github.com/ishara-sampath/megaweb-dns-map-118m.git cd megaweb-dns-map-118m # 下载 Release Assets 以开始工作 # 参见：https://github.com/ishara-sampath/megaweb-dns-map-118m/releases ``` ## 🔬 1.18 亿个 URL 能做什么？该数据集是**存活且可解析的互联网**的一个快照。以下是涵盖安全研究、基础设施分析、数据科学和威胁情报的高级用例。 ### 1. 🔍 被动 DNS 情报与威胁狩猎无需向目标发送任何数据包即可将域名映射到 IP。 ``` # 查找解析到可疑 IP 的所有域名 grep "^203.0.113.45" data/domain_to_ip_resolved_118m.txt # 查找位于 /24 子网上的所有域名（例如，防弹托管范围） grep "^185.220.101\." data/domain_to_ip_resolved_118m.txt | awk '{print $1}' | sort -u # 与已知的恶意 IP 进行交叉比对（来自 TI feeds） comm -12 \ <(awk '{print $2}' data/domain_to_ip_resolved_118m.txt | sort -u) \ <(sort known_malicious_ips.txt) ``` **用例：** 识别威胁行为者使用的共享基础设施 —— 单个 C2 IP 通常托管数十个相似的钓鱼域名。 ### 2. 🗺️ 互联网基础设施映射了解大规模互联网的结构。 ``` # 统计唯一 IP 数量（有多少台不同的主机为 118M 个域名提供服务？） awk '{print $2}' data/domain_to_ip_resolved_118m.txt | sort -u | wc -l # 查找托管域名最多的 IP（共享托管 / CDN 节点） awk '{print $2}' data/domain_to_ip_resolved_118m.txt \ | sort | uniq -c | sort -rn | head -50 # 映射每个 ASN 的域名数量（需要 ipcalc 或 ip2asn 查询） awk '{print $2}' data/domain_to_ip_resolved_118m.txt \ | sort -u > unique_ips.txt # 然后通过 ip-api / MaxMind / BGP table lookup 进行管道处理 ``` **用例：** 发现少数 CDN（Cloudflare、Fastly、Akamai）承载了网络中极大比例的内容。 ### 3. 🏢 ASN 与托管服务提供商分析 ``` # 安装 bgptools 或使用本地的 MaxMind GeoIP DB # 将每个解析出的 IP 映射到其 ASN while read ip; do whois -h whois.cymru.com " -v $ip" done < unique_ips.txt > asn_map.txt # 按域名数量对托管提供商进行排名 cut -f3 asn_map.txt | sort | uniq -c | sort -rn | head -20 ``` **示例洞察：** 识别哪些云服务提供商（AWS、GCP、Azure、Hetzner、OVH）托管了这 1.18 亿域名的百分比。 ### 4. 🕵️ 子域名发现与枚举将此数据集作为被动的子域名字典源使用。 ``` # 提取目标 apex domain 的所有子域名 grep "\.target\.com$" data/internet_domains_118m_active.txt # 从观察到的模式中构建真实世界的子域名字典 grep -oP '(?<=\.)[a-z0-9-]+(?=\.[a-z0-9-]+\.[a-z]{2,})' \ data/internet_domains_118m_active.txt \ | sort | uniq -c | sort -rn | head -1000 > real_world_subdomain_wordlist.txt ``` **用例：** 比通用字典更好 —— 这些子域名实际存在于现网中。 ### 5. 🎯 钓鱼与域名抢注检测检测冒充合法品牌的域名。 ``` import re brands = ["paypal", "amazon", "google", "microsoft", "apple"] with open("data/internet_domains_118m_active.txt") as f: for line in f: domain = line.strip().lower() for brand in brands: # Homoglyph / typosquat patterns if brand in domain and not domain.endswith(f"{brand}.com"): print(f"SUSPECT: {domain}") ``` ``` # 基于 Regex：捕获常见的 typosquatting 模式 grep -E "(paypa1|paypai|amaz0n|g00gle|micros0ft)" \ data/internet_domains_118m_active.txt ``` **用例：** 将结果输入到 [dnstwist](https://github.com/elceef/dnstwist) 中进行品牌保护监控。 ### 6. 🧠 域名的机器学习与 NLP 域名中包含了关于用途、地理位置和合法性的丰富信号。 ``` import pandas as pd from sklearn.ensemble import RandomForestClassifier # 从原始域名字符串中进行特征提取 def extract_features(domain): return { "length": len(domain), "num_digits": sum(c.isdigit() for c in domain), "num_hyphens": domain.count("-"), "num_dots": domain.count("."), "entropy": calculate_entropy(domain), # high entropy = DGA candidate "tld": domain.rsplit(".", 1)[-1], "has_brand_keyword": any(b in domain for b in ["login", "secure", "bank"]), } # 训练 DGA (Domain Generation Algorithm) 检测器 # 标签：1 = DGA 生成，0 = 合法（来自此数据集） ``` **用例：** - DGA 僵尸网络域名检测 - 恶意域名分类 - TLD 分布分析 - 域名年龄/注册模式建模 ### 7. 📊 TLD 与域名趋势分析 ``` # 顶级域名的分布 awk -F. '{print $NF}' data/internet_domains_118m_active.txt \ | sort | uniq -c | sort -rn | head -30 # 二级域名长度分布 awk -F. '{print length($1)}' data/internet_domains_118m_active.txt \ | sort | uniq -c | sort -rn # 国家代码 TLD 细分（ccTLD 映射） grep -E "\.(cn|ru|de|uk|br|in|fr|jp)$" data/internet_domains_118m_active.txt \ | awk -F. '{print $NF}' | sort | uniq -c | sort -rn ``` ### 8. 🔗 IP 聚类与共享托管检测 ``` from collections import defaultdict ip_to_domains = defaultdict(list) with open("data/domain_to_ip_resolved_118m.txt") as f: for line in f: parts = line.strip().split() if len(parts) == 2: domain, ip = parts ip_to_domains[ip].append(domain) # 托管 500+ 个域名的 IP = 共享托管或 CDN bulk_hosters = { ip: domains for ip, domains in ip_to_domains.items() if len(domains) >= 500 } print(f"Found {len(bulk_hosters)} bulk-hosting IPs") ``` ### 9. 🌍 地理位置热力图可视化互联网的物理分布位置。 ``` import requests import json from collections import Counter # 读取唯一 IP（根据 API rate limits 进行采样） with open("unique_ips.txt") as f: ips = [line.strip() for line in f][:10000] country_counts = Counter() for ip in ips: r = requests.get(f"http://ip-api.com/json/{ip}?fields=country") country_counts[r.json().get("country", "Unknown")] += 1 # 使用 matplotlib / folium / plotly 绘图 ``` ### 10. 🔒 证书透明度与 HTTPS 覆盖率 ``` # 提取 apex domains 以进行 CT log correlation awk -F. '{ n=split($0,a,".") print a[n-1]"."a[n] }' data/internet_domains_118m_active.txt \ | sort -u > apex_domains.txt # 使用 httpx 检查 HTTPS 采用率 cat data/internet_domains_118m_active.txt \ | httpx -silent -status-code -title -tech-detect \ -o http_results.txt ``` ### 11. 🛡️ 攻击面映射（红队 / 漏洞赏金） ``` # 查找 bug bounty 项目范围内的所有域名 grep -E "(\.tesla\.com|\.apple\.com|\.google\.com)$" \ data/internet_domains_118m_active.txt > in_scope_domains.txt # 通过管道传输至 recon tools cat in_scope_domains.txt | httpx -silent | nuclei -t ~/nuclei-templates/ ``` ### 12. 📡 DNS 基础设施研究 ``` # 查找具有异常解析模式的域名 # （例如，解析到 0.0.0.0、127.x.x.x 或 RFC1918 地址的域名） grep -E "\s(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.|127\.|0\.0\.0\.0)" \ data/domain_to_ip_resolved_118m.txt > sinkholed_or_internal.txt # 已被 Sinkhole 的域名（已知的 sinkhole IP） grep -E "\s(52\.15\.113\.20|146\.185\.254\.21)" \ data/domain_to_ip_resolved_118m.txt ``` ## 🛠️ 重组主文件主文件过大，无法直接通过 Git 存储，因此通过 GitHub Release Assets 以 **118 个分割块** 的形式发布。下载所有分块后，请使用以下命令将它们合并回完整的数据集。 ### 合并域名分块 → 主域名列表 ``` # 按顺序将所有 part_*.txt 文件合并到主域文件中 cat data/split_chunks_domains/domains_part_*.txt > data/internet_domains_118m_active.txt ``` ### 合并已解析分块 → 主 IP 映射表 ``` # 按顺序将所有 resolved_part_*.txt 文件合并到主 IP 映射文件中 cat data/split_chunks_resolved/resolved_part_*.txt > data/domain_to_ip_resolved_118m.txt ``` ### 使用 dnsx 重新解析（可选 —— 从零开始构建）如果你想自行重新运行 DNS 解析，请按照以下步骤操作。 #### 1. 安装 `dnsx` `dnsx` 是一个快速、多功能的 DNS 工具包。你可以使用以下命令将其安装在任何基于 Linux 的发行版（Ubuntu、Debian、Kali 等）上： ``` # 确保您已安装 Go sudo apt update && sudo apt install golang -y # 通过 Go 安装 dnsx go install -v github.com/projectdiscovery/dnsx/cmd/dnsx@latest # 移动到您的系统路径 sudo mv ~/go/bin/dnsx /usr/local/bin/ ``` #### 2. 解析单个分块对每个分块使用以下推荐参数： ``` dnsx -l data/split_chunks_domains/domains_part_XX.txt \ -a \ # A record resolution -resp \ # Include resolved IP in output -silent \ # Clean output -r resolvers.txt \ # Custom resolver list (8.8.8.8, 1.1.1.1, etc.) -t 100 \ # 100 concurrent threads -rl 500 \ # 500 requests/second rate limit -retries 2 \ # Retry failed lookups -o data/split_chunks_resolved/resolved_part_XX.txt ``` ## 📁 仓库结构 ``` megaweb-dns-map-118m/ ├── .gitignore ├── LICENSE ├── README.md │ └── data/ ├── internet_domains_118m_active.txt ← Release Asset (reassembled) ├── domain_to_ip_resolved_118m.txt ← Release Asset (reassembled) │ ├── split_chunks_domains/ ← Release Assets (download these) │ ├── domains_part_00.txt │ └── ... (up to part_117.txt) │ └── split_chunks_resolved/ ← Release Assets (download these) ├── resolved_part_00.txt └── ... (up to part_117.txt) ``` ## ⚙️ 处理大文件的性能提示 ``` # 快速行计数（无需完整读取文件） wc -l data/internet_domains_118m_active.txt # 使用 ripgrep (rg) 进行搜索，比 grep 快约 10 倍 rg "amazon" data/internet_domains_118m_active.txt # 在所有分块中进行并行 grep（使用所有 CPU 核心） ls data/split_chunks_domains/ | parallel "grep '\.io$' data/split_chunks_domains/{}" # 无需加载到 RAM 的流式处理 awk '{print $2}' data/domain_to_ip_resolved_118m.txt | sort -u --buffer-size=4G > unique_ips.txt # 使用 SQLite 进行关系查询（导入一次，多次查询） sqlite3 dns.db < 由 ishara-sampath 制作

标签：DNS, IP 地址批量处理, 基础设施测绘, 威胁情报, 开发者工具, 网络安全, 被动DNS, 逆向工具, 隐私保护