Suuuryaa/KnowScrap
GitHub: Suuuryaa/KnowScrap
一款内置反机器人绕过、断点续爬与 AI 提取功能的 Python 全场景网页抓取框架。
Stars: 0 | Forks: 0
可靠的 Python 网络抓取:内置真实的反机器人绕过功能。
Python 核心 · Node.js 反检测
PlaywrightCrawler:重度依赖 JavaScript 的网站
``` from knowscraper import PlaywrightCrawler, Router, Dataset router = Router() dataset = Dataset(name="results") @router.default_handler async def handler(ctx): title = await ctx.page.title() await dataset.push_data({"title": title, "url": ctx.request.url}) crawler = PlaywrightCrawler( router=router, dataset=dataset, stealth_mode=True, # removes webdriver flag, spoofs plugins inject_fingerprint=True, # real Chrome fingerprint via Node.js random_interactions=True, # human-like mouse + scroll ) await crawler.run(["https://example.com"]) ```AICrawler:无需选择器
``` from knowscraper import AICrawler, Router, Dataset router = Router() dataset = Dataset(name="products") @router.default_handler async def handler(ctx): # plain English : Claude reads the page and extracts the data data = await ctx.extract("product name, price, rating, and availability") await dataset.push_data(data) # also works for actions await ctx.act("click the Accept Cookies button if it exists") crawler = AICrawler(router=router, dataset=dataset) # 设置 ANTHROPIC_API_KEY 或传递 api_key="sk-ant-..." await crawler.run(["https://example.com/products"]) ```基于标签的路由:列表页 + 产品页
``` from knowscraper import CheerioCrawler, Router, Dataset, Request router = Router() dataset = Dataset(name="products") @router.handler("listing") async def on_listing(ctx): await ctx.enqueue_links(selector="a.product-link", label="product") await ctx.enqueue_links(selector="a.next-page", label="listing") @router.handler("product") async def on_product(ctx): await dataset.push_data({ "name": ctx.parsed.select_one("h1").get_text(strip=True), "price": ctx.parsed.select_one(".price").get_text(strip=True), "url": ctx.request.url, }) crawler = CheerioCrawler(router=router, dataset=dataset) await crawler.run([Request(url="https://shop.example.com", label="listing")]) ```插件:去重、重试、统计
``` from knowscraper import CheerioCrawler, DedupPlugin, RetryPlugin, StatsPlugin stats = StatsPlugin() crawler = CheerioCrawler( router=router, plugins=[ DedupPlugin(key="url"), # drop duplicate records before saving RetryPlugin(base_delay=2.0), # exponential backoff: 2s, 4s, 8s... stats, ], ) await crawler.run(["https://example.com"]) print(stats.report()) # {"requests_by_domain": {...}, "avg_response_time": 0.3, "data_records_saved": 42} ```代理轮换 + session 池
``` from knowscraper import CheerioCrawler, ProxyConfiguration proxy_config = ProxyConfiguration( proxy_urls=[ "http://user:pass@proxy1.example.com:8080", "http://user:pass@proxy2.example.com:8080", ], rotate="round_robin", ) crawler = CheerioCrawler(router=router, proxy_configuration=proxy_config) ```恢复崩溃的抓取任务
``` # 首次运行:正常爬取 await crawler.run(["https://example.com"]) # 重启:从停止的确切位置继续(SQLite 队列持久化) await crawler.run(resume=True) ```Sitemap 发现
``` from knowscraper import fetch_sitemap_urls from datetime import datetime urls = await fetch_sitemap_urls( "https://example.com", modified_after=datetime(2024, 1, 1), max_urls=500, ) await crawler.run(urls) ```标签:DFIR, MITM代理, Playwright, Python, Web爬虫, 反爬虫对抗, 无后门, 浏览器自动化, 特征检测, 逆向工具