Cyber-Threat-Hunting-Playground/generate_brand_domain_impersonations_with_punycode
GitHub: Cyber-Threat-Hunting-Playground/generate_brand_domain_impersonations_with_punycode
Stars: 0 | Forks: 0
# Generate Brand Domain Impersonations with Punycode
A production-ready utility that generates **Punycode (`xn--`) impersonation variants** for a list of brand domains by applying Unicode TR39 visual-confusable substitutions. The output is a CSV/TSV/JSON list of `(original, punycode_variant)` pairs.
Perfect for:
- 🎯 **Threat hunting** — finding domain impersonations in DNS logs, WHOIS records, SSL certificates
- 🔍 **Brand protection** — detecting and monitoring spoofed domains
- 📊 **Research** — analyzing IDN (Internationalized Domain Name) attack patterns
- 🛡️ **Security testing** — validating DNS/email filtering defenses
## Quick Start
### 1. Provide your brand domains
# Edit the template to add your brand FQDNs (one per line)
cp brand_domains.txt.example brand_domains.txt
# Add domains like:
# example.com
# mybrand.io
# your-domain.org
### 2. Run the generator
python generate_brand_domain_impersonations.py
**Output** is written to `brand_domains_impersonation.txt`:
example.com,xn--exmple-cua.com
example.com,xn--exampl-jua.com
mybrand.io,xn--mybrand-ewa.io
...
### 3. Advanced options
| Flag | Default | Description |
|---|---|---|
| `--input PATH` | `brand_domains.txt` | Path to the domain list |
| `--output PATH` | `brand_domains_impersonation.txt` | Destination file |
| `--confusables PATH` | `unicode_TR39_confusables.txt` (bundled) | Path to the TR39 confusables file |
| `--max-substitutions N` | `1` | Replace up to N positions at once with confusables |
| `--max-variants M` | `500,000` | Stop collecting variants per domain after M unique Punycode outputs |
| `--format {csv,tsv,json}` | `csv` | Output format |
| `--deduplicate` | — | Remove duplicate Punycode variants across all input domains |
| `--workers N` | `1` | Number of worker processes for parallel processing |
| `--stats` | — | Print generation statistics |
| `--verbose` | — | Enable detailed logging |
#### Examples
Generate all two-character substitution combos, capped at 200,000 variants per domain:
python generate_brand_domain_impersonations.py \
--max-substitutions 2 \
--max-variants 200000
Use 4 parallel workers and get JSON output with statistics:
python generate_brand_domain_impersonations.py \
--workers 4 \
--format json \
--stats
Deduplicate and export as TSV:
python generate_brand_domain_impersonations.py \
--format tsv \
--deduplicate \
--verbose
## How It Works
Input: "example.com"
↓
1. Lookup visual confusables for each character:
- 'e' → ['ε', 'е', 'ℯ', ...] (Greek epsilon, Cyrillic e, etc.)
- 'x' → ['×', 'х', ...] (multiplication sign, Cyrillic h, etc.)
- 'a' → ['α', 'а', ...] (Greek alpha, Cyrillic a, etc.)
- 'm' → ['m', 'ᴍ', ...] (Latin m, Latin small cap m, etc.)
- 'p' → ['р', ...] (Cyrillic r, etc.)
- 'l' → ['l', 'ⅼ', '1', ...] (Latin l, Roman numeral L, digit 1, etc.)
- 'o' → ['ο', 'о', '0', ...] (Greek omicron, Cyrillic o, digit 0, etc.)
↓
2. Generate combinations (k=1 means single substitutions):
- exαmple.com (a → α)
- exаmple.com (a → а, Cyrillic)
- еxample.com (e → е, Cyrillic)
↓
3. Encode to IDNA/Punycode:
- exαmple.com → xn--exmple-cua.com ✓ (has xn--)
- example.com → example.com ✗ (no xn--, plain ASCII)
↓
Output: "example.com,xn--exmple-cua.com"
## Threat Context (MITRE ATT&CK)
This tool helps detect **[T1584.001 - Acquire Infrastructure: Domains](https://attack.mitre.org/techniques/T1584/001/)** and **[T1587.001 - Develop Capabilities: Malware](https://attack.mitre.org/techniques/T1587/001/)** attacks where adversaries:
1. **Register lookalike domains** using confusable Unicode characters
2. **Target email users** via visually identical domain names (homograph attacks)
3. **Bypass security filters** that only check ASCII domains
4. **Host phishing, credential theft, or malware distribution** on these variants
### Real-World Examples
| Brand | Spoofed Variant | Punycode | Detection Method |
|-------|-----------------|----------|------------------|
| `apple.com` | `αpple.com` (α = Greek alpha) | `xn--pple-1oa.com` | DNS resolution monitoring |
| `amazon.com` | `amаzon.com` (а = Cyrillic a) | `xn--amazn-7ua.com` | Certificate transparency logs |
| `github.com` | `gіthub.com` (і = Cyrillic i) | `xn--gthub-5pf.com` | Email header analysis |
## Performance & Caching
### IDNA Encoding Cache
The script caches `encode("idna")` results to avoid redundant Unicode→Punycode conversions:
Without cache: 1000 domains × 500 variants = 500,000 encodings
With cache: Many duplicates eliminated → 10% cache hit rate typical
View cache performance with `--stats`:
IDNA cache hits/misses: 47,382/52,618 (47.4% hit rate)
### Parallel Processing
Use `--workers N` to leverage multi-core CPUs:
# 4 workers processing 10,000 domains in parallel
time python generate_brand_domain_impersonations.py \
--workers 4 \
--max-variants 10000
# Single-threaded: ~45 seconds
# 4 workers: ~15 seconds (3× speedup)
**Note:** Parallel processing uses `multiprocessing.Pool`. Each worker gets a copy of the inverse confusables map.
## Output Formats
### CSV (default)
example.com,xn--exmple-cua.com
example.com,xn--exampl-jua.com
### TSV
example.com xn--exmple-cua.com
example.com xn--exampl-jua.com
### JSON
{
"version": "1.0",
"generated_at": "2026-05-27T15:37:13Z",
"variants": [
{
"original": "example.com",
"punycode": "xn--exmple-cua.com"
},
{
"original": "example.com",
"punycode": "xn--exampl-jua.com"
}
]
}
## Input File Format (`brand_domains.txt`)
- One fully-qualified domain name per line (ASCII recommended)
- Lines starting with `#` and blank lines are ignored
- Trailing/leading whitespace is stripped
# My brand domains (comment line ignored)
example.com
mybrand.io
# Production domains
api.production.company.com
## Dependencies
- **Python 3.9+**
- **No third-party packages** — standard library only
- Bundled `unicode_TR39_confusables.txt` (Unicode Technical Standard #39)
## Files
| File | Description |
|---|---|
| `generate_brand_domain_impersonations.py` | Main script (enhanced with logging, caching, parallelization) |
| `test_generate_brand_domain_impersonations.py` | Unit tests (20+ test cases) |
| `unicode_TR39_confusables.txt` | Bundled Unicode TR39 confusables data (UTS #39) |
| `brand_domains.txt.example` | Template for the input domain list |
| `brand_domains.txt` | *(gitignored)* Your actual domain list |
| `brand_domains_impersonation.txt` | *(gitignored)* Generated Punycode variants (CSV by default) |
## Testing
Run the included unit test suite:
# With pytest (if installed)
pytest test_generate_brand_domain_impersonations.py -v
# Or with unittest (Python standard library)
python test_generate_brand_domain_impersonations.py
**Test coverage:**
- ✅ Confusables file loading and error handling
- ✅ Inverse mapping generation
- ✅ Substitution spot detection
- ✅ Character replacement logic
- ✅ IDNA encoding with caching
- ✅ Variant generation pipeline
- ✅ Domain file parsing
- ✅ Output formatting (CSV, TSV, JSON)
- ✅ Integration workflow
## Security Considerations
### ⚠️ Scope Limitations
This tool generates **visually similar variants** based on Unicode confusables, but does **NOT**:
- Register domains on your behalf (it only generates variant names)
- Perform DNS lookups or WHOIS queries
- Validate if variants are actually registered/operational
- Check HTTPS certificates for domain variants
- Simulate phishing attacks or user interaction testing
### Recommended Use Cases
✅ **Detection:**
- Monitor DNS query logs for Punycode domains in your variants list
- Check Certificate Transparency logs for issuances
- Hunt in email logs for domain-based phishing attempts
- Analyze passive DNS databases
✅ **Defense:**
- Generate baseline of expected variants for your brand
- Set up alerts in email gateways for Punycode/lookalike domains
- Implement homograph attack detection in browsers/clients
- Register defensive variants before attackers do
### Responsible Disclosure
If you discover active attacks using variants from this tool:
1. **Document the threat** (domain, registration details, hosting IP)
2. **Report to law enforcement** (IC3, FBI InfraGard, INTERPOL)
3. **Notify the brand owner** if not your organization
4. **Alert domain registry** for takedown assistance
5. **Share with MISP/threat feeds** (with permission)
## Examples & Real-World Scenarios
### Scenario 1: Email Security Team Monitoring
# Generate variants for protected brands
python generate_brand_domain_impersonations.py \
--input important_brands.txt \
--output threat_variants.csv \
--max-variants 50000
# Feed into email gateway's homograph detection:
# - Block outbound/inbound emails to domains in threat_variants.csv
# - Alert on exact matches in message headers
# - Log attempts for forensics
### Scenario 2: Security Research on Phishing Datasets
# Generate variants for brands commonly targeted
python generate_brand_domain_impersonations.py \
--input phishing_targets.txt \
--format json \
--max-substitutions 2 \
--max-variants 100000 \
--workers 4 \
--stats
# Cross-reference with VirusTotal, URLhaus, PhishTank APIs
# Measure prevalence of homograph attacks in the wild
### Scenario 3: Brand Protection CI/CD Pipeline
# Automated weekly generation for Slack alerts
#!/bin/bash
VARIANTS=$(python generate_brand_domain_impersonations.py \
--input company_domains.txt \
--format json \
--stats 2>&1)
# Check if variants registered (using whois/API)
python check_registered_variants.py --input brand_domains_impersonation.txt
# Alert on any newly registered variants
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"Found ${NEW_VARIANTS} new homograph variants\"}" \
$SLACK_WEBHOOK_URL
## Unicode Confusables Reference
The bundled `unicode_TR39_confusables.txt` includes:
| Latin | Lookalikes | Unicode Names |
|-------|-----------|---------------|
| `a` | α, а, ɑ | Greek Alpha, Cyrillic A, Latin Script A |
| `e` | ε, е, ℯ | Greek Epsilon, Cyrillic Ie, Mathematical E |
| `o` | ο, о, 0 | Greek Omicron, Cyrillic O, Digit Zero |
| `p` | р, ρ | Cyrillic R, Greek Rho |
| `c` | с, ϲ | Cyrillic S, Greek Lunate Sigma |
| `l` | 1, ⅼ, І | Digit One, Roman Numeral L, Cyrillic I |
| `i` | 1, і, ı | Digit One, Cyrillic Byelorussian I, Dotless I |
[Full Unicode TR39 specification](https://unicode.org/reports/tr39/)
## License
This project is part of the **Cyber Threat Hunting Playground**. See repository LICENSE for details.
## References
- [MITRE ATT&CK T1584.001](https://attack.mitre.org/techniques/T1584/001/) — Acquire Infrastructure: Domains
- [Unicode TR39 Confusables](https://unicode.org/reports/tr39/) — Homoglyph Attack Prevention
- [OWASP: Internationalized Domain Names](https://owasp.org/www-community/attacks/IDN_Homograph_Attacks)
- [RFC 3490: IDNA](https://tools.ietf.org/html/rfc3490) — Internationalized Domain Names in Applications
- [RFC 5890: IDNA2008](https://tools.ietf.org/html/rfc5890) — Protocol
## Disclaimer
This tool is provided **as-is for authorized security research and threat hunting only**. Users are responsible for:
- Complying with all applicable laws and regulations
- Obtaining proper authorization before using this tool on any systems or data
- Using results only for defensive purposes (detection, monitoring, incident response)
- Not using this tool for malicious purposes (domain registration, phishing, fraud)
Unauthorized access to computer systems is illegal. Consult your organization's security and legal teams before deployment.