Starbird265/ghost-bypass
GitHub: Starbird265/ghost-bypass
Stars: 1 | Forks: 0
# ghost_bypass
Scrape any website. Works on Cloudflare-protected sites, WAFs, anti-bot systems, GDPR walls, and plain HTTP sites — automatically choosing the right technique.
## ✨ What makes it different
| Feature | ghost_bypass |
|---------|-------------|
| **ML level selection** | UCB1 bandit remembers which bypass level works per domain |
| **Domain-aware proxies** | Proxy A banned on site-X ≠ Proxy A banned on site-Y |
| **12 bypass levels (L0–L11)** | Auto-escalates from fast → stealthy → headful browser |
| **CF jump logic** | Detects Cloudflare → immediately promotes to headful UC |
| **Versatile extraction** | Returns HTML, text, links, images, meta — works on any site |
| **Custom extractors** | Pass your own `fn(html, url)` to get structured data in one call |
| **Zero config** | Works out of the box with `BypassEngine()` (raises clear errors if optional extras are missing) |
## 📚 Documentation
- [Getting Started Guide](GETTING_STARTED.md) — Prerequisites, installation (including brotli), and your first scrape.
- [How to Use Guide](HOW_TO_USE.md) — Advanced deep dive into ML Proxy Management, UCB1 Domain-aware Exploration, SiteLearner, CF Bypass logic, and 3-Tier Extraction.
## Installation
# Minimum (requests only — L0, L11)
pip install ghost-bypass
# With playwright + selenium + TLS fingerprinting (recommended)
pip install "ghost-bypass[full]"
# Specific extras
pip install "ghost-bypass[playwright]" # L3–L6
pip install "ghost-bypass[selenium]" # L7–L8
pip install "ghost-bypass[tls]" # L1–L2
After installing Playwright extras:
playwright install chromium
## Quick start
from ghost_bypass import BypassEngine
engine = BypassEngine()
result = engine.scrape("https://any-website.com/page/")
print(result['success']) # True
print(result['method']) # "L0:L0_requests_basic"
print(result['html']) # full page HTML
print(result['links']) # all absolute links
print(result['images']) # all image URLs
print(result['title']) # page title
## Full ML stack (recommended)
from ghost_bypass import BypassEngine, SiteLearner, MLProxyManager
engine = BypassEngine(
proxy_manager=MLProxyManager(), # domain-aware UCB proxy rotation
site_learner=SiteLearner(), # per-domain level memory
)
result = engine.scrape("https://cloudflare-protected-site.com/")
**First run** → tries L0, L1, L2… until success.
**Second run** → jumps directly to what worked (e.g. L3), skipping slower levels.
**CF detected** → immediately jumps to L8 (headful UC with turnstile support).
## Bypass levels (L0 → L11)
| Level | Name | Technology | CF bypass |
|-------|------|-----------|-----------|
| **L0** | `requests_basic` | `requests` + real headers | ❌ |
| **L1** | `requests_tls` | `curl_cffi` Chrome TLS fingerprint | ⚠️ partial |
| **L2** | `httpx_http2` | `httpx` HTTP/2 | ❌ |
| **L3** | `playwright_stealth` | Playwright headless + stealth JS | ⚠️ partial |
| **L4** | `playwright_headful` | Playwright **visible** + stealth JS | ✅ most sites |
| **L5** | `playwright_mobile_headless` | Mobile emulation, headless | ⚠️ |
| **L6** | `playwright_mobile_headful` | Mobile emulation, **visible** | ✅ |
| **L7** | `uc_headless` | Undetected ChromeDriver headless | ✅ |
| **L8** | `uc_headful` | Undetected ChromeDriver **visible** + Turnstile | ✅✅ best |
| **L9** | `drission` | DrissionPage Chromium hybrid | ✅ |
| **L10** | `requests_html` | pyppeteer JS rendering | ⚠️ partial |
| **L11** | `mechanize` | Classic HTTP (legacy sites) | ❌ |
## Result dict
result = engine.scrape(url)
result['success'] # bool
result['url'] # final URL after all redirects
result['status_code'] # HTTP status (or None for browser methods)
result['html'] # full page HTML
result['text'] # plain text (stripped HTML)
result['title'] # tag content
result['meta'] # {name: content} for all tags
result['links'] # deduplicated list of absolute links
result['images'] # deduplicated list of absolute
URLs
result['scripts'] # absolute