Cyberfilo/scrape-gen
GitHub: Cyberfilo/scrape-gen
Stars: 0 | Forks: 0
# scrape-gen
Interactive CLI that scrapes a given website and builds a **targeted password wordlist** from the OSINT facts it finds — company name, founding year, location, employee names, keywords — along with a companion `rationale.md` explaining *which scraped fact drove which password*.
## Status & scope
- **Type**: pentest-aid utility + portfolio piece
- **Stage**: shipped, stable for single-target use. No further features planned unless an engagement surfaces a need.
- **Persona**: one operator, one target at a time, authorized engagement only
- **Not designed for**: bulk targeting, autonomous operation, identity-attacks against individuals
- **Open to scale**: no — intentionally one-site-at-a-time with politeness defaults
## What it does
1. **Scrape** — politely crawls the target site (robots.txt-aware, same-registered-domain only).
- *On-domain* scope: seed pages only (about, contact, team, legal, ...).
- *Extended* scope: breadth-first within the same registered domain, capped by `SCRAPEGEN_MAX_PAGES`.
2. **Extract** — pulls structured facts: company name, founding/copyright year, location hints → locale, employee names from team/about pages, emails, meta keywords.
3. **Generate** — CUPP-style combinatorial wordlist: bases (company, domain, location, people) × case variants × suffixes (years, `123`, `!`) × leetspeak × locale-aware seasons/months.
4. **Enrich (optional)** — calls GPT-5.4 (default) or Claude Opus 4.7 to propose additional bases and write a narrative rationale.
5. **Output** — `output/.wordlist.txt` and `output/.rationale.md`.
## Install
git clone https://github.com/Cyberfilo/scrape-gen.git
cd scrape-gen
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .env # edit to add OPENAI_API_KEY / ANTHROPIC_API_KEY
## Run
scrape-gen # or: python -m scrapegen
The interactive wizard walks you through: target URL → scope → LLM provider → generation components → output paths → confirmation → scrape/extract/generate.
## LLM provider
Default is **OpenAI `gpt-5.4-2026-03-05`** (set via `SCRAPEGEN_OPENAI_MODEL`). Fallback/alt is **Anthropic `claude-opus-4-7`**. Without any API key the pipeline still produces a full heuristic wordlist — the LLM only adds extra base suggestions and a polished narrative in `rationale.md`.
| Env var | Purpose |
|---|---|
| `SCRAPEGEN_PROVIDER` | `openai` (default) / `anthropic` / `none` |
| `OPENAI_API_KEY` | OpenAI key |
| `SCRAPEGEN_OPENAI_MODEL` | defaults to `gpt-5.4-2026-03-05` |
| `ANTHROPIC_API_KEY` | Anthropic key |
| `SCRAPEGEN_ANTHROPIC_MODEL` | defaults to `claude-opus-4-7` |
| `SCRAPEGEN_MAX_PAGES` | crawler cap (default 40) |
| `SCRAPEGEN_REQUEST_TIMEOUT` | per-request timeout seconds (default 15) |
| `SCRAPEGEN_USER_AGENT` | overrides default UA |
## Rationale output
`rationale.md` contains:
- The full list of extracted OSINT facts.
- Each base word used, with the scraped evidence that justified it (e.g. `Roma — scraped location mention: 'roma'`).
- The top-N generated passwords with their per-entry reason chain.
- (If LLM enrichment ran) additional bases proposed by the model and a written narrative.
This makes the wordlist defensible in a report: every entry traces back to a public fact.
## Project layout
scrapegen/
__main__.py # python -m entry
cli.py # Rich + questionary TUI wizard
config.py # env + provider selection
scraper.py # httpx crawler, robots-aware, on-domain / extended
extractors.py # BeautifulSoup-based fact extraction
generator.py # CUPP-style wordlist w/ rationale per entry
llm.py # OpenAI + Anthropic unified interface
rationale.py # rationale.md builder
## Scope controls
Both crawl constraints are toggleable in the wizard (step 2) since they're politeness defaults, not security controls — a sanctioned engagement usually wants them off.
| Flag | Wizard default | What it does when ON |
|---|---|---|
| Respect `robots.txt` | **off** | Fetches `/robots.txt` and skips disallowed URLs. |
| Follow cross-domain links | **on** (extended only) | Lets the extended crawl follow links to other registered domains (careers hosted on Greenhouse, docs on Zendesk, subsidiaries, etc.), capped at 3 pages per external domain to avoid rabbit-holing. Hard-bounded by `SCRAPEGEN_MAX_PAGES` overall. |
Other guardrails (not toggleable):
- Only fetches HTML (asset extensions like `.png`, `.pdf`, `.js` are skipped).
- Synchronous fetches with per-request timeout (`SCRAPEGEN_REQUEST_TIMEOUT`).
- `ScrapeResult.external_domains_visited` is logged in the TUI summary so you can see where the crawl actually went.