Starbird265/ghost-bypass

GitHub: Starbird265/ghost-bypass

一个基于 ML 算法自动选择最优策略的高级隐蔽网页抓取框架，专门用于绕过 Cloudflare 等反机器人系统提取数据。

Stars: 1 | Forks: 0

# ghost_bypass 抓取任何网站。适用于受 Cloudflare 保护的站点、WAF、反机器人系统、GDPR 墙以及普通 HTTP 站点 —— 自动选择合适的技术。 ## ✨ 独有特性 | 功能 | ghost_bypass | |---------|-------------| | **ML 级别选择** | UCB1 bandit 会记住每个域名有效的绕过级别 | | **域名感知代理** | 代理 A 在站点 X 被封 ≠ 代理 A 在站点 Y 被封 | | **12 个绕过级别 (L0–L11)** | 自动升级，从快速 → 隐蔽 → 有头浏览器 | | **CF 跳转逻辑** | 检测到 Cloudflare → 立即提升为有头 UC | | **多功能提取** | 返回 HTML、文本、链接、图像、元数据 —— 适用于任何网站 | | **自定义提取器** | 传入你自己的 `fn(html, url)`，一次调用即可获取结构化数据 | | **零配置** | 开箱即用，只需 `BypassEngine()`（如果缺少可选附加组件，会抛出明确的错误） | ## 📚 文档 - [入门指南](GETTING_STARTED.md) — 前置条件、安装（包括 brotli）以及你的第一次抓取。 - [使用指南](HOW_TO_USE.md) — 深入了解 ML 代理管理、UCB1 域名感知探索、SiteLearner、CF 绕过逻辑和三层提取系统。 ## 安装 ``` # 最小化（仅 requests — L0, L11） pip install ghost-bypass # 使用 playwright + selenium + TLS fingerprinting（推荐） pip install "ghost-bypass[full]" # 特定 extras pip install "ghost-bypass[playwright]" # L3–L6 pip install "ghost-bypass[selenium]" # L7–L8 pip install "ghost-bypass[tls]" # L1–L2 ``` 安装 Playwright 附加组件后： ``` playwright install chromium ``` ## 快速开始 ``` from ghost_bypass import BypassEngine engine = BypassEngine() result = engine.scrape("https://any-website.com/page/") print(result['success']) # True print(result['method']) # "L0:L0_requests_basic" print(result['html']) # full page HTML print(result['links']) # all absolute links print(result['images']) # all image URLs print(result['title']) # page title ``` ## 完整 ML 堆栈（推荐） ``` from ghost_bypass import BypassEngine, SiteLearner, MLProxyManager engine = BypassEngine( proxy_manager=MLProxyManager(), # domain-aware UCB proxy rotation site_learner=SiteLearner(), # per-domain level memory ) result = engine.scrape("https://cloudflare-protected-site.com/") ``` **首次运行** → 依次尝试 L0、L1、L2…… 直到成功。 **第二次运行** → 直接跳转到上次有效的级别（例如 L3），跳过较慢的级别。 **检测到 CF** → 立即跳转至 L8（带有 turnstile 支持的有头 UC）。 ## 绕过级别 (L0 → L11) | 级别 | 名称 | 技术 | CF 绕过 | |-------|------|-----------|-----------| | **L0** | `requests_basic` | `requests` + 真实标头 | ❌ | | **L1** | `requests_tls` | `curl_cffi` Chrome TLS 指纹 | ⚠️ 部分支持 | | **L2** | `httpx_http2` | `httpx` HTTP/2 | ❌ | | **L3** | `playwright_stealth` | Playwright 无头模式 + stealth JS | ⚠️ 部分支持 | | **L4** | `playwright_headful` | Playwright **可见** + stealth JS | ✅ 大多数网站 | | **L5** | `playwright_mobile_headless` | 移动端模拟，无头 | ⚠️ | | **L6** | `playwright_mobile_headful` | 移动端模拟，**可见** | ✅ | | **L7** | `uc_headless` | Undetected ChromeDriver 无头模式 | ✅ | | **L8** | `uc_headful` | Undetected ChromeDriver **可见** + Turnstile | ✅✅ 最佳 | | **L9** | `drission` | DrissionPage Chromium 混合模式 | ✅ | | **L10** | `requests_html` | pyppeteer JS 渲染 | ⚠️ 部分支持 | | **L11** | `mechanize` | 经典 HTTP（旧版网站） | ❌ | ## 结果字典 ``` result = engine.scrape(url) result['success'] # bool result['url'] # final URL after all redirects result['status_code'] # HTTP status (or None for browser methods) result['html'] # full page HTML result['text'] # plain text (stripped HTML) result['title'] # tag content result['meta'] # {name: content} for all <meta> tags result['links'] # deduplicated list of absolute <a href> links result['images'] # deduplicated list of absolute <img src> URLs result['scripts'] # absolute <script src> URLs result['cookies'] # {name: value} dict result['headers'] # response headers dict result['method'] # e.g. "L3:L3_playwright_stealth" (format: "L{n}:{level_name}") result['level'] # integer 0–11 result['cf_detected'] # True if Cloudflare was detected on any attempt result['duration'] # total seconds across all attempts result['attempts'] # list of per-attempt detail dicts result['data'] # custom extractor output (if extractor= provided) result['error'] # error message if failed, else None ``` ## 域名感知代理轮换 ``` from ghost_bypass import MLProxyManager mgr = MLProxyManager() # 添加你自己的 proxies mgr.add_proxies([ "http://1.2.3.4:8080", "http://5.6.7.8:3128", ], tier="custom") # 可选获取免费公共 proxies（已注释掉，因为免费 proxies # 对 Cloudflare 不可靠 — 对于 CF 站点请使用你自己的付费 proxies） # mgr.fetch_free_proxies() # 获取特定 domain 的最佳 proxy proxy = mgr.get_best_proxy(domain="example.com") # 报告结果（反馈给 UCB 模型） mgr.report_result( proxy=proxy, domain="example.com", success=True, latency=1.2, cloudflare_blocked=False, ) # Proxy 报告 print(mgr.pool_summary()) print(mgr.best_for_domain("example.com", top_n=5)) print(mgr.get_banned_proxies()) print(mgr.get_banned_proxies(domain="example.com")) # 手动解除 ban mgr.unban_proxy("http://1.2.3.4:8080") # global mgr.unban_proxy("http://1.2.3.4:8080", domain="site.com") # domain only ``` ### 域名感知封禁的工作原理 ``` Proxy "http://1.2.3.4:8080" ├── global: healthy (success_rate=0.85) ├── example.com: healthy (3 successes, 0 failures) ├── cloudflare-site.com: CF-BANNED for 1h (got 403) └── slow-site.org: domain-banned for 15m (< 15% success) ``` 在 `cloudflare-site.com` 上被封禁的代理，对于 `example.com` **仍然可用**。 ## 站点记忆 (SiteLearner) ``` from ghost_bypass import SiteLearner sl = SiteLearner() # 它对某个 domain 了解什么？ print(sl.domain_summary("example.com")) # { # "domain": "example.com", # "cf_detected": false, # "js_required": false, # "last_success_method": "L0_requests_basic", # level_name 格式 # "last_seen": 1716823456.0, # "methods_tried": 3 # } # 获取某个 domain 的 ML 排序 level chain print(sl.get_chain("example.com")) # ["L0_requests_basic", "L1_requests_tls", "L3_playwright_stealth", ...] # ^ 使用 level_name 格式（没有 "L3:" 前缀）。"L3:L3_xxx" 格式 # 仅在抓取后的 result['method'] 中出现。 # 如果之前检测到 CF，不具备 CF 能力的 methods 会被自动过滤 # 所有具有存储 memory 的 domains print(sl.all_domains()) # 删除某个 domain 的 memory（重置其 chain） sl.forget_domain("example.com") ``` ## 线程安全的代理租用与动态延迟为了在不触发 IP 封禁或速率限制的情况下扩展高吞吐量的并发抓取，`ghost_bypass` 实现了先进的 ML 驱动并发控制。 ### 1. 并发代理租用当多个 worker 并发运行时（例如在 `scrape_many` 中），它们绝不能在同一时间使用相同的代理 IP 向同一个目标域名发起请求。`MLProxyManager` 强制执行了一种租用机制： * **获取租用**：当 worker 尝试某个绕过级别时，它会专门为该目标域名借用一个高评价的代理进行*独家*租用。 * **排除机制**：请求相同域名的并发 worker 将自动绕过已租用的代理，并选择次优的替代方案。 * **释放租用**：无论请求成功还是失败，都会保证在 `finally` 块中将代理释放回连接池。如果需要，你可以关闭代理租用功能： ``` engine.scrape(url, lease_proxies=False) ``` ### 2. 自适应速率限制步调调整 `SiteLearner` 会监控目标域名的 `HTTP 429 (Too Many Requests)` 速率限制响应。 * **自动退避**：如果遇到 429 错误，`SiteLearner` 会立即提高该域名的建议延迟。 * **衰减机制**：在连续成功的周期中，步调延迟会自然衰减回配置的最低基准线。 * **Worker 同步**：`scrape_many` 中的并发 worker 会通过每个域名的线程锁自动协调，并遵循以下两者中的最大值： * 用户指定的自定义/随机延迟（例如 `domain_delay=(2, 5)`） * `SiteLearner` 的自适应退避延迟。要调用具有动态步调调整的并发抓取： ``` urls = ["https://site.com/p1", "https://site.com/p2", "https://other.com/p1"] # 使用 5 个 workers 进行并发抓取、自定义 delay 范围和 ML pacing results = engine.scrape_many( urls, workers=5, domain_delay=(2.0, 5.0) # Random delays between 2 and 5 seconds per domain ) ``` ## 三层提取系统立即从页面中提取结构化数据，无论是否编写代码。 **第 1 层：CSS 选择器字典** ``` engine = BypassEngine() result = engine.scrape("https://shop.example.com/product/", extract={ "price": ".price", "title": "h1" }) print(result['data']) # {"price": "$19.99", "title": "Cool Widget"} ``` **第 2 层：自定义 Python 函数** ``` from bs4 import BeautifulSoup def my_extractor(html: str, url: str) -> dict: soup = BeautifulSoup(html, "html.parser") return {"stock": soup.select_one("#stock").text} result = engine.scrape(url, extractor=my_extractor) ``` **第 3 层：AI 驱动的提取（需要安装 `ghost-bypass[ai]`）** 传入一段纯英文提示。会自动检测本地模型（Ollama、LM Studio）或使用 OpenAI/Anthropic/Gemini 密钥。 ``` result = engine.scrape(url, prompt="extract product name, price, and stock status") print(result['data']) # {"name": "Widget", "price": "$19.99", "stock": "In Stock"} ``` ## 速率限制与并行抓取使用 `scrape_many` 并行抓取多个 URL。当多个 worker 同时请求同一个域名时，内置的域名锁机制可以防止 IP 被封禁。 ``` from ghost_bypass import BypassEngine engine = BypassEngine(request_timeout=30) urls = ["https://site.com/page1", "https://site.com/page2", "https://site.com/page3"] # 5 个并行 workers，但保证对 site.com 的 requests 之间有 2.0s 的 delay results = engine.scrape_many(urls, workers=5, domain_delay=2.0) ``` 对于手动循环，请添加你自己的延迟： ``` import time, random for url in urls: result = engine.scrape(url) time.sleep(random.uniform(1.0, 3.0)) # min_delay=1.0, max_delay=3.0 ``` ## `ghost` CLI ghost_bypass 自带了一个功能强大的 CLI，用于抓取、代理管理和 AI 密钥管理。 ``` # 从终端进行 Scrape ghost scrape https://example.com --extract '{"title":"h1","price":".price"}' ghost scrape https://example.com --prompt "extract product info" # 并行 scraping ghost scrape-many https://example.com/1 https://example.com/2 --workers 5 # 管理 proxies ghost proxy fetch # fetch free proxies ghost proxy ping # test all proxies ghost proxy list # list healthy proxies # 管理站点 memory ghost memory list # see which domains have CF detected # 交互式 REPL ghost repl # > /scrape https://example.com # > /extract https://example.com {"title":"h1"} # > /keys autodetect ``` ### AI 密钥与本地模型使用 CLI 管理用于第 3 层提取的密钥： ``` ghost keys autodetect # Auto-discover Ollama/LM Studio running locally ghost keys add openai sk-... # Add an API key ``` ## Cookie 持久化 ``` from ghost_bypass import BypassEngine, CookieManager # Cookies 在 ttl_days 后自动过期（默认：7） cm = CookieManager(ttl_days=7) engine = BypassEngine(cookie_manager=cm) result = engine.scrape("https://cf-protected-site.com/") # 在重复访问时，保存的 cookies 会跳过 CF challenge # 手动管理 cookies print(cm.list_domains()) # domains with saved cookies cm.clear("https://example.com") # clear one domain cm.clear_all() # wipe all ``` ## 广告拦截器与弹窗关闭器 ``` from ghost_bypass import AdBlocker, PopupCloser # Playwright blocker = AdBlocker(max_iterations=5) blocker.handle_playwright(page, original_url) # Selenium — 阻塞模式 closer = PopupCloser() closer.close_all(driver, original_url) # Selenium — 后台线程 + JS interval monitor import threading lock = threading.Lock() closer.start_monitoring(driver, lock, interval=2.0) # ... 执行你的 scraping ... closer.stop_monitoring() ``` ## 人类行为模拟在使用 Selenium/UC 时，`HumanBehavior` 会在有头浏览器级别（L4、L6、L8）中**自动**应用。它提供贝塞尔曲线鼠标移动、惯性滚动和逼真的输入操作，以避免被机器人检测系统识别。你也可以直接使用它： ``` from ghost_bypass import HumanBehavior human = HumanBehavior(min_delay=0.08, max_delay=0.45, movement_speed="medium") # 与任何 Selenium driver 一起使用 human.human_scroll(driver, direction="down", smooth=True) human.human_click(driver, element, overshoot=True) human.type_like_human(element, "search query") human.page_view_pattern(driver, duration=3.0) # realistic browsing ``` ## 架构 ``` ghost_bypass/ ├── engine/ │ ├── engine.py ← BypassEngine (L0–L11 dispatch + ML orchestration) │ └── site_learner.py ← SiteLearner (per-domain UCB method memory) ├── proxy/ │ └── manager.py ← MLProxyManager (domain-aware UCB proxy rotation) ├── cloudflare/ │ └── handler.py ← CloudflareHandler (detect + wait for CF to resolve) ├── ad_blocker/ │ ├── blocker.py ← AdBlocker (overlay/modal/cookie banner closer) │ └── popup_closer.py ← PopupCloser (window + JS interval monitor) └── support/ ├── stealth.py ← StealthConfig (anti-bot JS patches) ├── cookies.py ← CookieManager (per-domain persistence, configurable TTL) └── human.py ← HumanBehavior (Bézier mouse, scroll, typing — auto-applied in headful levels) ``` ### 升级流程 ``` URL requested │ ├─ SiteLearner.get_chain(domain) ─→ UCB-ranked level list │ (new domain: L0→L11 default order) │ (known domain: starts at best-known level) │ └─ For each level in chain: │ ├─ MLProxyManager.get_best_proxy(domain) ─→ best proxy for this site │ (UCB: blends global + domain-specific stats) │ ├─ Run level method (L0 → L11) │ ├─ CF detected? ──yes──→ inject L8 as next attempt immediately │ ├─ Proxy failed? ─────→ try next-best proxy for same level │ ├─ Level failed? ─────→ escalate to next level │ └─ Success? ──────────→ record stats, return rich result dict ``` ## 贡献请参阅 [CONTRIBUTING.md](CONTRIBUTING.md)。欢迎提交 PR —— 特别是新的绕过级别！ ## 许可证 MIT — 详见 [LICENSE](LICENSE)。</div><div><strong>标签：</strong>AI风险缓解, Apex, Playwright, Selenium, Splunk, WAF绕过, 反爬虫绕过, 机器学习, 特征检测, 运行时操纵, 逆向工具</div></article></div>  <script> (function () { var base = (document.querySelector('base') && document.querySelector('base').getAttribute('href')) || ''; var path = base.replace(/\/?$/, '') + '/cap-wasm/cap_wasm.min.js'; window.CAP_CUSTOM_WASM_URL = new URL(path, window.location.href).href; })(); </script> </body> </html>