b3rt1ng/Octocrawl

GitHub: b3rt1ng/Octocrawl

OctoCrawl 是一款专为渗透测试和漏洞赏金设计的轻量级异步网络爬虫，支持高并发抓取、关键词搜索、认证页面访问和模块化扩展。

Stars: 13 | Forks: 1

# OctoCrawl 🐙 [![Python 版本](https://img.shields.io/badge/python-3.13%2B-blue.svg)](https://python.org) [![许可证：MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 一个轻量级的异步网络爬虫，使用 Python 编写，用于网络侦察（Web recon）。为什么叫 OctoCrawl？因为多线程模式看起来就像一只试图寻找路径的章鱼……这就足够了！ ### 演示

## ✨ 功能 * **异步抓取**：使用 `asyncio` 和并发工作线程快速抓取网站。 * **状态检查**：验证每个发现资源的状态码。 * **关键词搜索**：扫描基于文本的文件内容以查找指定的关键词列表。 * **Cookie 处理**：允许传递 Cookie 以抓取需要身份验证的页面。 * **树状展示**：直接在终端中直观地显示网站的架构。 * **模块 API**：使用 OctoCrawl 的结果运行您自己的爬取后脚本！ ## 🚀 安装 ### 推荐方式 — 通过 pipx `octocrawl` 现在支持 [pipx](https://pipx.pypa.io/) ！ ``` # 简单安装 pipx install octocrawl # 直接从 GitHub 安装 OctoCrawl（最新版本） pipx install git+https://github.com/b3rt1ng/Octocrawl ``` 后续升级： ``` pipx upgrade octocrawl ``` ### 从源码安装 ``` git clone https://github.com/b3rt1ng/Octocrawl cd Octocrawl pipx install . ``` ## 💻 用法 ### 基础抓取 ``` octocrawl https://example.org/ ``` ### 进阶示例使用 15 个工作线程运行抓取，显示完整 URL，搜索关键词并保存报告： ``` octocrawl https://example.org --workers 15 --fullpath --keywords "api,user" -o report.txt ``` ### 所有选项 ``` positional arguments: url The starting URL for the crawl. options: -w, --workers NUM Number of concurrent workers (default: 80). -i, --ignore EXT File extensions to ignore in the report, comma-separated. -d, --display EXT File extensions to display exclusively in the report. --fullpath Display full URLs in the tree report. -o, --output FILE Save the report to a file (.txt or .json). --timeout SEC Timeout per HTTP request in seconds (default: 10). -c, --cookies "k=v" Cookies to send with requests. -ra, --random-agent Randomize user agent for each request. --agent "UA string" Custom User-Agent string. -k, --keywords "w1,w2" Keywords to search for in pages. -a, --add "p1,p2" Additional paths to crawl. --no-robots, -nr Skip checking robots.txt. --no-sitemap, -ns Skip checking sitemap.xml. --parser HTML parser: 'lxml' (fast) or 'html.parser' (default). --version Display the current version. -M, --modules Modules to run after crawl (comma-separated). --list-modules List all available modules. --module-info NAME Show info about a specific module. ``` ### 优化您的抓取默认情况下，爬虫使用 [html.parser](https://docs.python.org/3/library/html.parser.html)，但对于大型网站，您可以切换到 `lxml`。它使用基于 C 的引擎，当端点数量增加时速度明显更快： ``` pipx inject octocrawl lxml octocrawl https://example.org --parser lxml ``` ## 🔧 模块 OctoCrawl 拥有一个模块系统，用于处理收集数据的爬取后脚本。 ``` # 列出可用模块 octocrawl --list-modules # 在 crawl 后运行特定模块 octocrawl https://example.org -M headers # 运行所有模块 octocrawl https://example.org -M all ``` 请查看 [MODULES.MD](https://github.com/b3rt1ng/Octocrawl/blob/main/src/octocrawl/modules/MODULES.MD) 文件以获取有关如何编写自定义模块的文档，并使用 [示例模块](https://github.com/b3rt1ng/Octocrawl/blob/main/src/octocrawl/modules/example.py) 作为基础。

标签：Asyncio, Blue Team, Bug Bounty, Cookie认证, ESC4, OSINT, Python, Web安全, 二进制发布, 侦察工具, 内容抓取, 可自定义解析器, 大数据, 密码管理, 开源工具, 异步编程, 无后门, 目录扫描, 站点地图, 网络安全, 蓝队分析, 计算机取证, 路径发现, 逆向工具, 隐私保护