salecharohit/semhound

GitHub: salecharohit/semhound

面向 GitHub 组织的 Semgrep 自动化编排工具，一键发现、克隆并扫描所有仓库，可选 AI 辅助分类，帮助安全团队以组织规模快速定位特定代码模式。

Stars: 10 | Forks: 1

# semhound [![发布](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/be69ad6af7070253.svg)](https://github.com/salecharohit/semhound/actions/workflows/release.yml) [![PyPI 版本](https://img.shields.io/pypi/v/semhound?cacheSeconds=300)](https://pypi.org/project/semhound) [![Python 版本](https://img.shields.io/pypi/pyversions/semhound?cacheSeconds=300)](https://pypi.org/project/semhound) [![PyPI 下载量](https://img.shields.io/pypi/dm/semhound?cacheSeconds=300)](https://pypi.org/project/semhound) [![许可证: MIT](https://img.shields.io/badge/license-MIT-blue)](LICENSE)

semhound — Hunt secrets & vulnerabilities across GitHub orgs with Semgrep + AI triage

**semhound** 实现了组织级别的 Semgrep 扫描自动化——您只需提供规则，它负责处理一个或多个 GitHub 组织或用户账户下所有仓库的发现、克隆、扫描和报告。您可以选择将每个发现结果发送给 AI 提供商，通过自定义 prompt 来分类真阳性和误报。就像 [TruffleHog](https://github.com/trufflesecurity/trufflehog) 全面扫描仓库寻找秘密信息一样，semhound 全面扫描仓库寻找您定义的任何代码模式。 ## 工作原理 1. **发现** — 使用 `gh repo list` 查找每个目标（组织或用户）的所有仓库 2. **克隆** — 通过 SSH 并行浅克隆每个仓库（`--depth 1`） 3. **扫描** — 在每个克隆的仓库中运行您的 Semgrep 规则 4. **报告** — 为每个目标写入合并后的 CSV（以及可选的 SARIF）文件，并附带每个发现结果的 GitHub 永久链接 ## 使用场景 semhound 专为**针对性、按需扫描**而设计——您定义一组精确的 Semgrep 规则，并在多个仓库中运行它们，以快速回答特定问题。它不是一个持续的、全规则全仓库的扫描器。将其视为精密仪器而非漫灌工具时，效果最佳。 **漏洞赏金 SQL 注入 —— 识别所有仓库中的相同模式** 一份漏洞赏金报告标记了您某个应用中的 SQL 注入漏洞。为该模式编写一条 Semgrep 规则，扫描您的整个组织，以查找存在相同问题的所有其他仓库。 **第三方 OSS 库中的零日漏洞 —— 查找仍在运行易受攻击版本的每个仓库** 一个广泛使用的库爆出了零日漏洞——比如 log4j。编写一条 Semgrep 规则，匹配依赖文件中的该版本字符串，一次性扫描您所有的组织。您将立即获得仍在运行易受攻击版本的所有仓库列表，以便在漏洞被武器化之前优先进行升级。 ## 前置条件必须安装以下工具并将其添加到您的 `PATH` 中。semhound 在启动时会检查所有这些工具，并为任何缺失的工具打印针对特定平台的安装说明。 | 工具 | macOS | Linux | Windows | |------|-------|-------|---------| | [GitHub CLI `gh`](https://cli.github.com) — 仓库发现 | `brew install gh` | [安装指南](https://github.com/cli/cli/blob/trunk/docs/install_linux.md) | `winget install --id GitHub.cli` | | `git` — 浅克隆 | `brew install git` | `sudo apt install git` | `winget install --id Git.Git` | | [Semgrep](https://semgrep.dev) — 静态分析 | `brew install semgrep` | `pip install semgrep` | `pip install semgrep` | | OpenSSH — 通过 SSH 克隆 | macOS 自带 | `sudo apt install openssh-client` | Windows 10/11 自带 | **认证 GitHub CLI**（只需一次）： ``` gh auth login ``` 向您的 GitHub 账户**注册 SSH 密钥**（只需一次），以便 semhound 可以克隆私有仓库： [docs.github.com/en/authentication/connecting-to-github-with-ssh](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) ## 安装 **推荐 — [`pipx`](https://pipx.pypa.io)**（macOS / Linux）： `pipx` 将 CLI 工具安装到隔离的虚拟环境中，并使其全局可用——无需管理 `venv`，也不会与系统 Python 发生冲突。 ``` # 安装 pipx（仅一次） brew install pipx # macOS # 或：pip install --user pipx # Linux # 安装 semhound pipx install semhound ``` **备选方案 — `pip`**（在虚拟环境中）： ``` python3 -m venv .venv source .venv/bin/activate pip install semhound ``` **从源码安装**（用于本地开发）： ``` git clone git@github.com:salecharohit/semhound.git cd semhound pip install -e . ``` ## 使用方法 ``` semhound [ORG_OR_USER ...] [--orgs-file PATH] --rules-dir PATH Local folder of Semgrep .yaml rule files --rules-url URL HTTPS URL of a Semgrep rule file (repeatable) --ai-config PATH AI provider config file (omit to skip AI triage) --threads N Parallel worker threads per target (default: 5) --sarif Also write a SARIF 2.1.0 report alongside the CSV ``` 内联传递一个或多个 GitHub 组织名称或用户名，通过 `--orgs-file` 加载列表，或者混合使用。所有目标会被去重并按顺序扫描；每个目标会生成自己的 `_scan.csv` 文件。 ``` # 单个 org semhound acme-corp --rules-dir ./rules # 单个 user account semhound octocat --rules-dir ./rules # 内联混合 org 和 user semhound acme-corp octocat --rules-dir ./rules # 从文件加载 org semhound --orgs-file orgs.txt --rules-dir ./rules # Org 文件 + 内联 username semhound octocat --orgs-file orgs.txt --rules-dir ./rules # Remote rule — 无需本地文件 semhound acme-corp \ --rules-url https://raw.githubusercontent.com/example/rules/main/sqli.yaml # 全面扫描：org 文件 + remote rule + AI triage + 10 个 threads semhound --orgs-file orgs.txt \ --rules-dir ./rules \ --rules-url https://raw.githubusercontent.com/example/rules/main/extra.yaml \ --ai-config ai.config \ --threads 10 ``` `orgs.txt` — 每行一个组织名称或用户名；忽略空行和 `#` 注释。 ## Semgrep 规则规则来源于本地目录（`--rules-dir`）、一个或多个 HTTPS URL（`--rules-url`），或两者兼有。至少需要指定一个来源。规则必须是有效的 Semgrep `.yaml` 文件。通过 `--rules-url` 下载的文件会被放置在临时目录中，并在扫描结束后删除。 ## AI 分析（可选）将 `ai.config.example` 复制为 `ai.config`，填入您的凭证，并传入 `--ai-config ai.config`。每个发现结果都会被发送给模型，模型将返回一个**置信度分数**（0–100）和一个**真阳性**判定。如果不使用 `--ai-config`，这些列将保持空白。 ### 支持的提供商 | 提供商 | 必填字段 | 备注 | |----------|----------------|-------| | `claude` | `api_key`, `model` | Anthropic 直接 API | | `openai` | `api_key`, `model` | OpenAI API | | `gemini` | `api_key`, `model` | Google Gemini API | | `bedrock` | `aws_region`, `model` | 使用标准 AWS 凭证链——不需要 API 密钥 | `system_prompt` 字段是可选的，但强烈建议使用——根据您的场景定制该字段可以产生更准确的判定。请使用以下示例作为起点。 ### 示例：漏洞赏金 SQL 注入扫描 — AWS Bedrock 不需要 API 密钥；凭证来自 `~/.aws/credentials`、IAM 角色、SSO 等。在 AWS 控制台的 **Bedrock → 模型访问权限** 中查找模型 ID。 ``` provider: bedrock aws_profile: default # omit to use the default credential chain aws_region: us-east-1 model: anthropic.claude-3-5-sonnet-20241022-v2:0 system_prompt: > You are an application security engineer triaging SQL injection findings flagged by a Semgrep rule after a bug bounty report. For each code snippet, assess whether user-controlled input reaches a database query without going through a parameterised query or ORM. Rate confidence based on how directly the input flows into the query. Be concise and precise. ``` ### 示例：零日漏洞库扫描 — OpenAI ``` provider: openai api_key: sk-... model: gpt-4o system_prompt: > You are an application security engineer triaging findings from a zero-day sweep across the org. A CVE has been published for a specific function in a third-party library. For each code snippet, assess whether the flagged function call matches the vulnerable usage pattern described in the CVE, and whether any caller-side mitigations such as input validation or version guards are already present. Prioritise findings where the dangerous call is reachable with no mitigations. Be concise and precise. ``` **实时分类输出：** ``` [analyze] my-repo — sqli-raw-format [ai] my-repo — sqli-raw-format | confidence=91 true_positive=true ``` 如果提供商返回无法解析的响应，该工具将在记录 `ERROR` 之前，使用指数退避（1 秒 → 2 秒 → 4 秒）最多重试 3 次。 ## 输出结果将被写入 `_scan.csv` 文件。传入 `--sarif` 以同时生成 `_scan.sarif` 文件。 | 列 | 描述 | |--------|-------------| | Repository | 仓库名称 | | Rule | Semgrep 规则 ID | | Issue Description | 规则消息 | | Location | 指向具体代码行的 GitHub 永久链接 | | Confidence Score (AI) | 0–100（未使用 `--ai-config` 时为空白） | | True Positive (AI) | `true` / `false`（未使用 `--ai-config` 时为空白） | ## 常见问题 ### **这个工具是为谁准备的？** semhound 是为**紫队和蓝队**构建的——适用于需要以组织规模（而不是逐个仓库）识别易受攻击代码模式的安全工程师。无论您是在响应漏洞赏金报告、在收购公司的代码库中扫描某个 CVE，还是在 200 个仓库中强制执行某种安全模式，semhound 都能让您通过一个命令获得答案。 ### **需要什么身份验证？** semhound 使用两种机制。`gh auth login` 会创建一个 OAuth token，用于通过 `gh repo list` 发现仓库。克隆使用已在您的 GitHub 账户中注册的密钥通过 SSH 进行——这比 HTTPS 更受青睐，因为密钥不会过期，从不嵌入 URL 中，并且在并行克隆数百个仓库时没有凭证助手的开销。 ### **它会扫描 git 历史记录吗？** 不会。semhound 对默认分支进行浅克隆（`--depth 1`）并扫描代码的当前状态。它专为在多个仓库中实现广泛、快速的覆盖而设计，而不是用于深度的取证历史分析。 ### **它与 TruffleHog 或 Gitleaks 有什么不同？** TruffleHog 和 Gitleaks 是专门构建的密钥扫描器——它们使用自己内置的签名来检测 API 密钥、token 和凭证。semhound 不是密钥扫描器。它运行您提供给它的任何 Semgrep 规则——安全漏洞、危险函数调用、易受攻击的依赖版本、自定义代码模式。查找密钥请使用 TruffleHog；当您需要以组织规模搜寻任意代码模式时，请使用 semhound。 ### **它与直接运行 Semgrep 有什么不同？** Semgrep 是一个扫描器；它需要一个目标。直接运行它意味着您需要自己克隆每个仓库、运行命令、收集结果，然后重复操作。semhound 封装了整个循环——它会发现组织或用户账户中的所有仓库，并行克隆它们，在所有仓库中运行您的规则，并写入合并后的 CSV。一个命令即可替代原本需要跨数十或数百个仓库编写的 shell 脚本。 ### **它与 GitHub Advanced Security (GHAS) 有什么不同？** GHAS 必须逐个仓库启用，并且私有仓库需要 GitHub Enterprise 许可证。semhound 可与任何 GitHub 账户配合使用，无需按仓库进行设置，并允许您引入自己的 Semgrep 规则。它可以在任何地方按需运行，针对您有权访问的任何组织或用户。 ### **它与 git-secrets 有什么不同？** git-secrets 是一个预提交钩子，用于在提交时阻止开发者提交密钥。semhound 是一个回顾性的组织级扫描器——它扫描已经存在的、跨团队和跨组织的仓库，寻找您定义的模式。不同的问题，不同的工具。 ### **为什么 semhound 只克隆最大 1 MB 的文件？** Semgrep 默认会静默跳过任何大于 1,000,000 字节（1 MB）的文件。下载超过该阈值的文件会消耗带宽和磁盘 I/O，却不会产生任何发现结果。因此，semhound 将 `--filter=blob:limit=1m` 传递给 `git clone`，使克隆限制与扫描器限制保持一致——大型的二进制文件、图像、视频和自动生成的资源永远不会被传输。如果您的规则针对的是超过 1 MB 的文件（例如大型生成文件或第三方捆绑包），请同时提高这两个限制：将 `--max-target-bytes` 传递给 Semgrep，并相应地在源代码中调整克隆过滤器。 ### **semhound 适合用于持续或定时扫描吗？** semhound 经过优化，适用于有针对性的、按需扫描——而不是在使用广泛规则集的 cron 定时任务中对您的整个代码库进行扫描。每次扫描都使用带有 1 MB blob 限制的浅克隆（`--filter=blob:limit=1m --depth 1`）以保持传输轻量化，但即使是克隆一个拥有 200 个仓库的中型组织，如果反复运行或使用大量规则，仍然会消耗大量带宽并产生繁重的 SSD 读/写周期。最佳使用场景是由特定事件触发的一组焦点规则：新的 CVE、漏洞赏金发现、收购的代码库审查。把它当成手术刀来用，而不是割草机。 ## 贡献欢迎贡献！在提交 PR 之前，请阅读 [CONTRIBUTING.md](CONTRIBUTING.md)——其中涵盖了分支命名、提交消息格式（约定式提交）以及自动发布流水线的工作原理。

标签：AI分类, Bedrock, Bug Bounty, Claude, CVE检测, DevSecOps, DLL 劫持, Gemini, GitHub安全, GPT, Python, SARIF, Semgrep, SQL注入检测, StruQ, WordPress安全扫描, секрет检测, 上游代理, 人工智能, 代码安全, 可自定义解析器, 大语言模型, 安全合规, 开源安全工具, 无后门, 漏洞枚举, 漏洞管理, 用户模式Hook绕过, 组织安全管理, 网络代理, 网络安全, 网络安全研究, 逆向工具, 逆向工程平台, 错误基检测, 隐私保护, 静态代码分析