s0rg/crawley

GitHub: s0rg/crawley

一个遵循Unix哲学的命令行网页爬虫工具，专注于提取和发现网页中的各类链接与隐藏路径。

Stars: 341 | Forks: 18

[![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/03/557fbda703052851.svg)](https://github.com/s0rg/crawley/actions?query=workflow%3Aci) [![Go 报告卡](https://goreportcard.com/badge/github.com/s0rg/crawley)](https://goreportcard.com/report/github.com/s0rg/crawley) [![libraries.io](https://img.shields.io/librariesio/github/s0rg/crawley)](https://libraries.io/github/s0rg/crawley) ![Issues](https://img.shields.io/github/issues/s0rg/crawley) [![许可证](https://img.shields.io/badge/license-MIT%20License-blue.svg)](https://github.com/s0rg/crawley/blob/main/LICENSE) [![FOSSA 状态](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley?ref=badge_shield) [![Go 版本](https://img.shields.io/github/go-mod/go-version/s0rg/crawley)](go.mod) [![发布版](https://img.shields.io/github/v/release/s0rg/crawley)](https://github.com/s0rg/crawley/releases/latest) [![收录于 Awesome Go](https://awesome.re/mentioned-badge.svg)](https://github.com/avelino/awesome-go) ![Downloads](https://img.shields.io/github/downloads/s0rg/crawley/total.svg) # crawley 爬取网页并打印发现的任何链接。 # 功能 - 快速的 html SAX 解析器（由 [x/net/html](https://golang.org/x/net/html) 驱动） - js/css 词法解析器（由 [tdewolff/parse](https://github.com/tdewolff/parse) 驱动）- 从 js 代码中提取 API 端点和 `url()` 属性 - 小巧（低于 1500 SLOC）、惯用、100% 测试覆盖的代码库 - 抓取大多数有用的资源 URL（图片、视频、音频、表单等...） - 发现的 URL 流式传输到 stdout，并保证唯一（忽略片段） - 可配置扫描深度（默认为 0，受限于起始主机和路径） - 可以保持礼貌 - 遵守 `robots.txt` 中的爬取规则和站点地图 - `brute` 模式 - 扫描 html 注释中的 URL（这可能导致错误结果） - 利用 `HTTP_PROXY` / `HTTPS_PROXY` 环境变量 + 处理代理认证（使用 `HTTP_PROXY="socks5://127.0.0.1:1080/" crawley` 进行 socks5 代理） - 仅目录扫描模式（又名 `fast-scan`） - 用户自定义 cookies，采用 curl 兼容格式（例如 `-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file`） - 用户自定义 headers，与 curl 相同：`-header "ONE: 1" -header "TWO: 2" -header @headers-file` - 标签过滤器 - 允许指定要爬取的标签（单个：`-tag a -tag form`，多个：`-tag a,form`，或混合） - URL 忽略 - 允许在爬取时忽略包含匹配子串的 URL（例如：`-ignore logout`） - 子域名支持 - 允许同时深度爬取子域名（例如 `crawley http://some-test.site` 将能够爬取 `http://www.some-test.site`） # 示例 ``` # 打印首页的所有链接： crawley http://some-test.site # 打印所有 js 文件和 api 端点： crawley -depth -1 -tag script -js http://some-test.site # 打印 js 中的所有端点： crawley -js http://some-test.site/app.js # 下载站点中的所有 png 图像： crawley -depth -1 -tag img http://some-test.site | grep '\.png$' | wget -i - # 快速目录遍历： crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site ``` # 安装 - Linux, FreeBSD, macOS 和 Windows 的 [二进制文件 / deb / rpm](https://github.com/s0rg/crawley/releases)。 - [archlinux](https://aur.archlinux.org/packages/crawley-bin/) 您可以使用您喜欢的 AUR 助手来安装它，例如 `paru -S crawley-bin`。 # 用法 ``` crawley [flags] url possible flags with default values: -all scan all known sources (js/css/...) -brute scan html comments -cookie value extra cookies for request, can be used multiple times, accept files with '@'-prefix -css scan css for urls -delay duration per-request delay (0 - disable) (default 150ms) -depth int scan depth (set -1 for unlimited) -dirs string policy for non-resource urls: show / hide / only (default "show") -header value extra headers for request, can be used multiple times, accept files with '@'-prefix -headless disable pre-flight HEAD requests -ignore value patterns (in urls) to be ignored in crawl process -js scan js code for endpoints -proxy-auth string credentials for proxy: user:password -robots string policy for robots.txt: ignore / crawl / respect (default "ignore") -silent suppress info and error messages in stderr -skip-ssl skip ssl verification -subdomains support subdomains (e.g. if www.domain.com found, recurse over it) -tag value tags filter, single or comma-separated tag names -timeout duration request timeout (min: 1 second, max: 10 minutes) (default 5s) -user-agent string user-agent string -version show version -workers int number of workers (default - number of CPU cores) ``` # 标志自动补全 Crawley 可以通过 `complete` 在 bash 和 zsh 中处理标志自动补全： ``` complete -C "/full-path-to/bin/crawley" crawley ``` # 许可证 [![FOSSA 状态](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley?ref=badge_large)

标签：EVTX分析, Go, Golang, JS分析, meg, Python扩展, robots.txt, Ruby工具, SAX解析, Snort++, Unix哲学, 二进制发布, 代理支持, 信息安全, 告警, 安全编程, 密码管理, 开源工具, 日志审计, 网站地图, 网页扫描, 被动扫描, 资产收集, 路径发现, 链接提取