apify/crawlee-python

GitHub: apify/crawlee-python

一个功能全面的 Python 网页爬取与浏览器自动化库，支持多种解析后端和代理轮换，专为构建可靠爬虫而设计。

Stars: 9194 | Forks: 754

A web scraping and browser automation library

Crawlee 涵盖了你的爬取和抓取端到端流程，并能 **帮你快速构建可靠的爬虫。** 即使使用默认配置，你的爬虫看起来也会几乎像真人操作，从而避开现代机器人检测的雷达。Crawlee 为你提供了抓取网页链接、提取数据并以机器可读格式持久存储的工具，无需担心技术细节。得益于丰富的配置选项，如果默认设置不能满足需求，你可以调整 Crawlee 的几乎任何方面以适应你的项目需求。我们还有一个 TypeScript 版本的 Crawlee 实现，你可以在项目中探索和使用它。访问我们的 GitHub 仓库了解更多信息 [Crawlee for JS/TS on GitHub](https://github.com/apify/crawlee)。 ## 安装我们建议访问 Crawlee 文档中的 [Introduction tutorial](https://crawlee.dev/python/docs/introduction) 获取更多信息。 Crawlee 在 PyPI 上以 [`crawlee`](https://pypi.org/project/crawlee/) 包的形式提供。该包包含核心功能，而附加功能作为可选扩展提供，以保持依赖项和包体积最小化。要安装包含所有功能的 Crawlee，请运行以下命令： ``` python -m pip install 'crawlee[all]' ``` 然后，安装 [Playwright](https://playwright.dev/) 依赖项： ``` playwright install ``` 验证 Crawlee 是否安装成功： ``` python -c 'import crawlee; print(crawlee.__version__)' ``` 有关详细的安装说明，请参阅 [Setting up](https://crawlee.dev/python/docs/introduction/setting-up) 文档页面。 ### 使用 Crawlee CLI 开始使用 Crawlee 的最快方法是使用 Crawlee CLI 并选择一个预先准备好的模板。首先，确保你已安装 [uv](https://pypi.org/project/uv/)： ``` uv --help ``` 如果未安装 [uv](https://pypi.org/project/uv/)，请遵循官方 [installation guide](https://docs.astral.sh/uv/getting-started/installation/)。然后，运行 CLI 并从可用模板中进行选择： ``` uvx 'crawlee[cli]' create my-crawler ``` 如果你已经安装了 `crawlee`，可以通过运行以下命令来启动它： ``` crawlee create my-crawler ``` ## 示例这里有一些实用的示例，可帮助你开始使用 Crawlee 中的不同类型的爬虫。每个示例都演示了如何为特定用例设置和运行爬虫，无论你是需要处理简单的 HTML 页面还是与大量使用 JavaScript 的站点交互。爬虫运行会在你的当前工作目录中创建一个 `storage/` 目录。 ### BeautifulSoupCrawler [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) 使用 HTTP 库下载网页，并向用户提供 HTML 解析后的内容。默认情况下，它使用 [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) 进行 HTTP 通信，并使用 [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) 解析 HTML。它非常适合需要从 HTML 内容中高效提取数据的项目。由于不使用浏览器，该爬虫具有非常好的性能。但是，如果你需要执行客户端 JavaScript 来获取内容，这就不够用了，你需要使用 [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler)。另外，如果你想使用这个爬虫，请确保安装带有 `beautifulsoup` 扩展的 `crawlee`。 ``` import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main() -> None: crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, } # Push the extracted data to the default dataset. await context.push_data(data) # Enqueue all links found on the page. await context.enqueue_links() # Run the crawler with the initial list of URLs. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` ### PlaywrightCrawler [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) 使用无头浏览器下载网页，并提供用于数据提取的 API。它建立在 [Playwright](https://playwright.dev/) 之上，这是一个专为管理无头浏览器而设计的自动化库。它擅长检索依赖客户端 JavaScript 生成内容的网页，或需要与 JavaScript 驱动的内容进行交互的任务。对于不需要执行 JavaScript 或需要更高性能的场景，请考虑使用 [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler)。另外，如果你想使用这个爬虫，请确保安装带有 `playwright` 扩展的 `crawlee`。 ``` import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': await context.page.title(), } # Push the extracted data to the default dataset. await context.push_data(data) # Enqueue all links found on the page. await context.enqueue_links() # Run the crawler with the initial list of requests. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` ### 更多示例请在 Crawlee 文档中浏览我们的 [Examples](https://crawlee.dev/python/docs/examples) 页面，查看更多额外的用例和演示。 ## 功能特性为什么 Crawlee 是网页抓取和爬取的首选？ ### 为什么使用 Crawlee 而不是仅仅使用带有 HTML 解析器的 HTTP 库？ - **HTTP 和无头浏览器**爬取的统一接口。 - 基于可用系统资源的自动**并行爬取**。 - 使用带有 **type hints** 的 Python 编写 - 增强 DX（IDE 自动补全）并减少错误（静态类型检查）。 - 发生错误或被屏蔽时的自动**重试**。 - 集成的**代理轮换**和会话管理。 - 可配置的**请求路由** - 将 URL 定向到相应的处理程序。 - 持久化的 **URL 队列**用于爬取。 - 可插拔的**存储**，支持表格数据和文件。 - 健壮的**错误处理**。 ### 为什么使用 Crawlee 而不是 Scrapy？ - **基于 Asyncio** – 利用标准的 [Asyncio](https://docs.python.org/3/library/asyncio.html) 库，Crawlee 提供了更好的性能，并与其他现代异步库无缝兼容。 - **Type hints** – 使用现代 Python 构建的较新项目，拥有完整的类型提示覆盖，提供更好的开发者体验。 - **简单集成** – Crawlee 爬虫是常规的 Python 脚本，不需要额外的启动器执行器。这种灵活性允许将爬虫直接集成到其他应用程序中。 - **状态持久化** – 支持在中断期间进行状态持久化，通过避免在出现问题后从头重启抓取 pipeline，从而节省时间和成本。 - **有组织的数据存储** – 允许在单次抓取运行中保存多种类型的结果。提供多种存储选项（参见 [datasets](https://crawlee.dev/python/api/class/Dataset) 和 [key-value stores](https://crawlee.dev/python/api/class/KeyValueStore)）。 ## 在 Apify 平台上运行 Crawlee 是开源的，可以在任何地方运行，但由于它是由 [Apify](https://apify.com) 开发的，因此很容易在 Apify 平台上设置并在云中运行。访问 [Apify SDK website](https://docs.apify.com/sdk/python/) 了解有关将 Crawlee 部署到 Apify 平台的更多信息。 ## 支持如果你发现 Crawlee 有任何错误或问题，请 [submit an issue on GitHub](https://github.com/apify/crawlee-python/issues)。如有疑问，你可以在 [Stack Overflow](https://stackoverflow.com/questions/tagged/apify)、GitHub Discussions 上提问，或者加入我们的 [Discord server](https://discord.com/invite/jyEM2PRvMU)。 ## 贡献欢迎你的代码贡献，你将受到永恒的赞扬！如果你有任何改进想法，请提交 issue 或创建 pull request。有关贡献准则和行为准则，请参阅 [CONTRIBUTING.md](https://github.com/apify/crawlee-python/blob/master/CONTRIBUTING.md)。 ## 许可证本项目基于 Apache License 2.0 授权 - 详见 [LICENSE](https://github.com/apify/crawlee-python/blob/master/LICENSE) 文件。

标签：AI数据, BeautifulSoup, ETL, Headless Chrome, IP 地址批量处理, JavaCC, LLM数据准备, Playwright, RAG, Splunk, Web Scraping, 代理轮换, 反检测, 命令控制, 开源, 数据提取, 数据泄露, 数据采集, 无头浏览器, 浏览器自动化, 演示模式, 特征检测, 网络信息收集, 逆向工具, 配置审计