quangdang46/why

GitHub: quangdang46/why

结合 Git 历史与大语言模型，为开发者解释代码存在原因并评估删除风险，从而实现更安全的代码重构。

Stars: 0 | Forks: 0

# why ## 快速开始 1. 安装 `why`。 2. 运行设置命令： ``` why config init ``` 3. 提问一个关于代码历史的问题： ``` why src/auth.rs:verify_token ``` 如果你只想记住一个设置命令，请记住 `why config init`。这是主要的设置流程，允许你选择 `anthropic`、`openai`、`zai` 或 `custom`。 ## 安装 ### 安装已发布的二进制文件 ``` curl -fsSL "https://raw.githubusercontent.com/quangdang46/why/main/install.sh?$(date +%s)" | bash ``` ### 从此检出目录在本地构建 ``` cargo run -q -p why-core -- --help cargo build -p why-core --release ./target/release/why --help ``` 发布的 Cargo 包是 `why-core`，安装后的二进制文件名称为 `why`。目前**不支持**使用 `cargo install why-core` 进行安装，因为 `crates/core/Cargo.toml` 仍然设置了 `publish = false`。 ### 生成 Shell 补全和 Man 手册 ``` cargo run -q -p why-core -- completions bash > why.bash cargo run -q -p why-core -- completions zsh > _why cargo run -q -p why-core -- completions fish > why.fish cargo run -q -p why-core -- manpage > why.1 ``` ### 配置 CLI 如果你跳过了上面的快速开始步骤，请运行： ``` why config init ``` 这是主要的设置流程，允许你交互式地选择 `anthropic`、`openai`、`zai` 或 `custom`。 ### 验证和基准测试 ``` cargo fmt --all -- --check cargo clippy --workspace --all-targets --all-features -- -D warnings cargo test --workspace --all-features ``` 选择性加入的真实仓库 CLI 覆盖率： ``` WHY_REAL_REPO_PATH=/absolute/path/to/git/checkout \ cargo test -p why-workspace --test integration_real_repo_cli -- --nocapture ``` - 真实仓库测试是选择性加入的，因此默认的测试套件保持确定性。 - 它们在运行 `why` 之前，将提供的代码检出克隆到一个临时的测试仓库中。基准测试命令： ``` cargo bench --package why-workspace --bench cache_bench cargo bench --package why-workspace --bench archaeology_bench cargo bench --package why-workspace --bench scanner_bench ``` GitHub Actions 也通过 `.github/workflows/bench.yml` 暴露了相同的 Criterion 运行，并将 `target/criterion/**` 作为构件上传。 ## 问题所在 Claude Code 无法访问 git 历史。它只能看到代码现在的样子——而没有任何关于以下内容的上下文： - 为什么代码会这样写？ - 这是不是针对某个突发事件的热修复？ - 这是不是 2019 年的“临时”代码后来变成了永久代码？ - 如果我删除这段代码会破坏什么？开发者经常会删除看起来已经废弃、但实际上是关键安全修复的代码。 Claude Code 也会做同样的事情——因为它无法读取过去的历史。 ## 为什么 `why` 与众不同 | | 仅 Claude Code | why | |---|---|---| | 理解 git 历史 | ❌ | ✅ | | 读取 PR 描述 | ❌ | ✅ | | 解释提交原因 | ❌ | ✅ | | 删除前的风险评估 | ❌ | ✅ | | 将代码与过去的事件关联 | ❌ | ✅ | **这是本套件中 Claude Code 在字面意义上无法复制的唯一工具**——因为 git 历史不会出现在任何上下文窗口中。 ## 工作原理 ``` 1. Identify (pure Rust, git2 crate) └── tree-sitter locates exact byte range of target function/line └── git2 runs git blame on that range └── collect all unique commits that touched those lines 2. Gather (pure Rust, git2 crate) └── for each commit: message, author, date, diff └── check for PR refs (#123), issue refs (fixes #456) └── extract comments and TODOs near the target code 3. Synthesize (LLM call — configured provider) └── feed structured git data to the configured model └── ask: "why does this exist? what risk if removed?" └── returns human-readable explanation + risk level ``` 每次查询只有**一次 LLM 调用**，并将结构化的 git 数据作为输入。 ## 技术栈 | Crate | 用途 | |---|---| | `git2` | 原生 git 操作——无需 git 二进制文件 | | `tree-sitter` | 精确定位函数边界 | | 感知 provider 的 HTTP 客户端 | 通过 Anthropic 或兼容 OpenAI 的 provider 将 git 数据合成为解释 | | `clap` | CLI | | `serde_json` | 结构化输出 | ## 使用方法 ### 查询目标 `why` 使用位置目标语法，而不是 `fn|file|line` 子命令。支持的查询形式： - `why :` - `why :` - `why :` - `why --lines ` 支持符号解析的语言： - Rust (`.rs`) - Go (`.go`) - JavaScript (`.js`) - TypeScript (`.ts`, `.tsx`) - Java (`.java`) - Python (`.py`) 重要规则： - 行号从 1 开始（基于 1 的索引）。 - `--lines` 必须使用 `START:END` 格式。 - 不要将 `--lines` 与 `:` 或 `:` 混合使用。 - 除非与 `--lines` 配对使用，否则裸文件路径不是有效的查询。 ### 常见查询示例 ``` # 为何编写这一特定行？ why src/auth.rs:42 # 为什么存在此行范围？ why src/auth.rs --lines 40:45 --no-llm # 为什么存在此符号？ why src/auth.rs:verify_token # 针对 Rust impl 方法的限定符号查询 why src/auth.rs:AuthService::login --team # Machine-readable archaeology 输出 why src/auth.rs:verify_token --json # 将 archaeology 限制在最近的 commits why src/auth.rs:verify_token --since 30 # 检查在历史上与此 target 共变的文件 why src/auth.rs:verify_token --coupled # 显示可能的 owners 和 bus-factor 信号 why src/auth.rs:verify_token --team # 遍历过去的机械编辑，找到可能的真实原始 commit why src/auth.rs:verify_token --blame-chain # 显示感知重命名的 target 演变历史 why src/auth.rs:verify_token --evolution # 询问是否应拆分某个 symbol why src/auth.rs:verify_token --split # 在 target 上方编写有证据支撑的 annotation why src/auth.rs:verify_token --annotate # 当文件更改时刷新报告 why src/auth.rs:verify_token --watch --no-llm # 审查重命名 Rust symbol 是否安全 why src/auth.rs:verify_token --rename-safe ``` ### 查询标志 | 标志 | 用途 | 备注 | |---|---|---| | `--json` | 输出机器可读的结果 | 适用于主查询流程和许多子命令 | | `--no-llm` | 跳过 LLM 合成 | 适用于 CI、本地验证或无密钥环境 | | `--no-cache` | 绕过缓存结果 | 强制进行全新查询，而不是重用 `.why/cache.jsonl` | | `--since ` | 将历史限制在最近的提交中 | 适用于查询式考古/报告模式 | | `--coupled` | 显示文件级别的共同变更耦合 | 适合在大型重构之前使用 | | `--team` | 显示所有权和巴士因子信号 | 适合在选择审查人之前使用 | | `--blame-chain` | 跳过可能的机械提交 | 帮助查找某行或符号的真正来源 | | `--evolution` | 显示感知重命名的目标历史 | 时间线样式的输出 | | `--split` | 建议是否应拆分某个符号 | 面向符号的查询模式 | | `--annotate` | 在目标上方插入简短且有证据支撑的文档注解 | 这会修改文件 | | `--watch` | 在文件更改时重新运行默认报告 | 需要交互式终端 | | `--rename-safe` | 显示目标风险以及重命名分析的调用者风险 | 目前仅支持 Rust 符号目标 | ### 全仓库范围和审查命令大多数报告风格的子命令也支持 `--json`。 ``` # 按变动次数 × 启发式风险对 repository 热点进行排名 why hotspots --limit 10 # Repository 健康状况摘要 why health why health --ci 80 # 从 staged diff 生成对 reviewer 友好的 PR 模板 why pr-template # 结合 archaeology 支持的发现来审查 staged diff why diff-review --no-llm why diff-review --post-github-comment --github-ref '#42' # 对 incident 窗口内的可疑 commits 进行排名 why explain-outage --from 2025-11-03T14:00 --to 2025-11-03T16:30 # 将高风险 functions 与 coverage 数据进行交叉引用 why coverage-gap --coverage lcov.info # 查找在静态分析下显示为未调用的高风险 functions why ghost --limit 10 # 对新工程师应首先了解的 symbols 进行排名 why onboard --limit 10 # 查找过期的 TODOs、HACK/TEMP 标记以及已失效的 remove-after 日期 why time-bombs --age-days 180 ``` 关键行为说明： - `why pr-template` 读取的是**已暂存的 diff**，而不是未暂存的更改。 - `why diff-review` 同样读取**已暂存的 diff**。 - `why diff-review --post-github-comment` 需要一个有效的 GitHub 引用（例如 `#42`）以及已配置的 GitHub 远程仓库/token 路径。 - `why ghost` 使用启发式静态分析，并在终端输出中对此发出警告。 - `why health --ci ` 在债务分数超过阈值时，以退出码 `3` 退出。 - `why health` 回归门禁在配置的回归预算失败时，以退出码 `4` 退出。 ### 集成与开发者命令 ``` # 运行 MCP stdio server why mcp # 通过 stdio 运行聚焦 hover 的 LSP server why lsp # 启动交互式 archaeology shell why shell # 为受支持的 AI 工具输出 shell wrappers why context-inject # 安装或移除受管理的 git hooks why install-hooks --warn-only why uninstall-hooks # 生成 shell completions 或 man page why completions bash > why.bash why completions zsh > _why why completions fish > why.fish why manpage > why.1 ``` 更多细节： - `why shell` 启动一个带有索引补全支持的交互式 shell。 - 除非你自己显式传递 `--no-llm`，否则 shell 查询默认使用 `--no-llm`。 - 内置的 shell 命令包括 `help`、`reload`、`hotspots`、`health`、`ghost`、`exit` 和 `quit`。 - `why lsp` 是一个面向悬停提示的 LSP 服务器，返回 Markdown 悬停内容和完整报告的 CLI 提示。 - `why context-inject` 输出的 shell 代码旨在按以下方式使用： eval "$(why context-inject)" 生成的包装器目前针对受支持的提示工具，例如 `claude`、`sgpt` 和 `llm`。 ### 历史上的 Node 原型在 `poc/` 目录下仍然存在一个 Node.js 原型，但它**不是**正式发布接口。诸如此类的示例仅属于原型阶段： ``` node poc/index.js fn verifyToken src/auth.js node poc/index.js file src/legacy/payment_v1.js node poc/index.js fn verifyToken src/auth.js --raw ``` 上面文档中描述的 Rust CLI 是当前工具受支持的正式接口。 ## 输出示例 ``` $ why src/auth.rs:42 why: src/auth.rs (line 42) Commits touching this line: a3f9b2c alice 2024-01-12 fix: tokens not expiring on logout 8d2e1f4 bob 2022-09-04 extend auth flow for refresh token handling No LLM synthesis (--no-llm or no API key). Heuristic risk: MEDIUM. ``` ## 风险语义和解释风格 `why` 应该让保守的更改决策变得更容易，而不是听起来比证据所支持的更有把握。 ### 风险级别 - **HIGH** —— 代码显示出安全敏感性、事件历史、关键的后向兼容行为，或其他表明删除可能会以非同寻常的方式破坏生产环境行为的信号。将此视为停止并调查的信号：在未经深入审查之前，不要删除或进行重大重构。 - **MEDIUM** —— 代码似乎与迁移、重试、遗留路径或过渡行为相关，更改可能是安全的，但前提是要理解周围的上下文。 - **LOW** —— 可用的历史记录和附近的代码没有显示出特殊的运维或兼容性压力。除非出现更强有力的证据，否则这是普通的工具代码。 ### 解释风格规则 - 将**证据**与**推断**分开。提交消息、注释和代码标记是证据；从中得出的结论是推断。 - 当历史记录稀少、嘈杂或模棱两可时，明确说明**未知因素**。 - 不要凭空捏造证据中不存在的突发事件、PR 上下文或依赖关系。 - 保持输出易于浏览：先是简洁的总结，然后是支持性历史，最后是风险。 - 当只有 1-2 个提交或弱信号可用时，向下校准置信度。 ### 置信度指南 `why` 在内部将置信度建模为一个枚举，并将其序列化为以下 JSON/字符串值之一： - **low** —— 历史记录单薄、提交消息薄弱，或几乎没有确凿的上下文。 - **medium** —— 有一些有用的历史信号，但直接证据有限。 - **medium-high** —— 历史意图明确，如热修复、突发事件或兼容性轨迹。 - **high** —— 多个确凿的来源指向相同的解释。 ## 与 Claude Code 集成添加到你项目的 `CLAUDE.md` 中： ``` ## 自定义工具 - `why :` — explain why a specific line was written - `why --lines ` — explain why a line range exists - `why :` — explain why a supported symbol exists - `why : --coupled` — inspect co-change dependencies before a deeper refactor - `why : --team` — identify likely owners before asking for review on risky code - `why : --blame-chain` — skip mechanical edits to find the real origin commit - `why : --evolution` — inspect rename-aware target history before large moves - `why diff-review --no-llm` — review the staged diff before opening a PR - `why health --json` — export a machine-readable repo health snapshot **Always run `why` before deleting or significantly refactoring any function that exists in git history for more than 6 months.** ``` 推荐的 Claude Code 工作流： 1. 在删除或重写不熟悉的代码之前，先在确切的符号或行范围上运行 `why`。 2. 如果报告的风险是 **HIGH**，请将其视为停止并调查的信号，而不是快速继续的建议。 3. 对于大型重构，还应根据你需要了解的内容运行 `--coupled`、`--team`、`--blame-chain` 或 `--evolution`。 4. 在发起 PR 之前，对已暂存的 diff 运行 `why diff-review`。 5. 对于编辑器/工具集成，请选择与你工作流匹配的接口： - `why mcp` 用于支持 MCP 的编辑器 - `why lsp` 用于面向悬停提示的编辑器集成 - `eval "$(why context-inject)"` 用于 shell 包装的提示工具推荐的代码审查例行程序： - 在提议删除看起来陈旧的代码时，包含 `why ... --json` 或终端摘要 - 当更改涉及运维敏感路径并且你需要找到最佳审查人时，使用 `why ... --team` - 在拆分或重新定位历史上噪音较大的函数之前，使用 `why ... --coupled` - 在共享分支之前，使用 `why diff-review` 总结暂存更改的风险有关特定于 MCP 的设置示例，请参阅 `docs/mcp-setup.md`。 ## 配置和凭证 `why` 支持分层配置： 1. 内置默认值 2. 位于 `$XDG_CONFIG_HOME/why/why.toml` 或 `~/.config/why/why.toml` 的全局配置 3. 仓库本地的 `why.local.toml` 使用 CLI 管理这些层： ``` # Global config 为默认 target why config init --provider anthropic --model claude-haiku-4-5-20251001 # 使用 --local 进行针对 repo 的覆盖 why config init --local --provider zai --model glm-5 why config init --local --provider custom --model local-model --base-url https://api.example.com/v1/chat/completions # 检查有效的合并 config，而不打印 secrets why config get why config get --json ``` 如果你在交互式终端中运行 `why config init`，而没有通过标志传递值，CLI 会提示你选择 provider、model、base URL、auth token、retries、max tokens 和 timeout。你可以将值留空以保留当前值或 provider 默认值，之后再编辑 `why.toml` 或 `why.local.toml`。支持的 provider： - `anthropic` - `openai` - `zai` - `custom` (兼容 OpenAI) 当前内置的默认值： - `anthropic` → 模型 `claude-haiku-4-5-20251001`，基础 URL `https://api.anthropic.com/v1/messages` - `openai` → 模型 `gpt-5.4`，基础 URL `https://api.openai.com/v1/chat/completions` - `zai` → 模型 `glm-5`，基础 URL `https://api.z.ai/api/anthropic/v1/messages` - `custom` → 无内置模型或基础 URL `why config get` 会隐藏机密，并通过 `llm.auth_configured` 报告认证是否已配置。环境变量优先于配置值。空值将被忽略。 Provider 凭证环境变量： ``` export ANTHROPIC_API_KEY=your_anthropic_api_key_here export OPENAI_API_KEY=your_openai_api_key_here export ZAI_API_KEY=your_zai_api_key_here export CUSTOM_API_KEY=your_custom_api_key_here ``` 配置示例： ``` [risk] default_level = "LOW" [risk.keywords] high = ["pci", "reconciliation"] medium = ["terraform", "webhook", "idempotency"] [git] max_commits = 8 recency_window_days = 90 mechanical_threshold_files = 50 coupling_scan_commits = 500 coupling_ratio_threshold = 0.30 [cache] max_entries = 500 [llm] provider = "openai" model = "gpt-5.4" base_url = "https://api.openai.com/v1/chat/completions" auth_token = "your_provider_token_here" retries = 3 max_tokens = 500 timeout = 30 [github] remote = "origin" # token = "ghp_..." # 可选的回退方案；优先使用 GITHUB_TOKEN env var ``` `[risk.keywords]` 使用特定于团队或领域的术语扩展内置的启发式词汇表。匹配不区分大小写，并且可以影响排序后的证据相关性和启发式风险级别。对于 GitHub 丰富化工作，请在环境可用时设置 `GITHUB_TOKEN`；配置也可以包含可选的 `[github]` 回退 token 和远程名称。环境变量优先于配置，空值将被忽略。机密处理指南： - 尽可能首选环境变量 - 如果你选择的话，全局配置对于本地开发是可以接受的 - 仓库本地的 `why.local.toml` 通常应避免包含机密，因为它更容易被意外提交有关当前配置表面的完整记录示例，请参阅 `.why.toml.example`。 ## 缓存和 `.why/` 目录语义当前行为： - 查询结果缓存在仓库根目录的 `.why/cache.jsonl` 中，每行一个 JSON 对象 - 缓存键包括目标以及当前的 `HEAD` hash 前缀，因此更改历史会自然使先前的条目失效 - 当重用存储的 `WhyReport` 时，终端输出会显示 `[cached]` - `--no-cache` 绕过缓存读取并强制进行全新查询 - `[cache].max_entries` 控制 `.why/cache.jsonl` 中保留的查询报告数量 - 滚动的健康快照单独存储在 `.why/health.jsonl` 中，每行一个 JSON 对象 - 最多保留 52 个健康快照 - CI 可以使用 `.github/health-baseline.json` 强制执行健康回归预算运维期望： - 将 `.why/` 视为本地运行时状态，而不是受源代码控制的项目状态 - 在正常的开发工作流中，`.why/` 应被 git 忽略 - 在 Unix 上，缓存目录和运行时文件以仅限所有者的权限写入（`.why/` 为 `0700`，`cache.jsonl`、`health.jsonl` 和 `runtime.log` 为 `0600`） - 如果你想清除本地缓存结果，删除 `.why/cache.jsonl` 是安全的；`why` 会在下次缓存运行时重新创建它 - 如果你想重置本地健康趋势历史，删除 `.why/health.jsonl` 是安全的 - 当合成失败并且 `why` 回退到启发式模式时，LLM 回退原因会附加到 `.why/runtime.log` 中 ## `why doctor` 使用 `why doctor` 验证当前的有效设置，并执行一次小型的实时 LLM 测试调用。 ``` why doctor why doctor --json ``` 它报告： - 有效配置路径和已解析的 LLM 设置， - 认证是否已配置， - LLM 客户端是否可以初始化， - 实时 LLM 调用是否成功。如果实时调用失败，`why doctor` 会直接报告错误，并且运行时日志仍可在 `.why/runtime.log` 中找到。 ## 索引位置没有持久化索引——`why` 按需读取 git 历史。对于交互式使用足够快（每次查询约 1-3 秒）。 ### 健康回归门禁将 `why health` 与已提交的基线结合使用，可以在任何债务分数或信号回归时报错失败： ``` cargo run -p why-core --bin why -- health \ --baseline-file .github/health-baseline.json \ --require-baseline \ --max-regression 0 \ --max-signal-regression time_bombs=0 \ --max-signal-regression high_risk_files=0 \ --max-signal-regression hotspot_files=0 \ --max-signal-regression stale_hacks=0 ``` 在已知的良好主线变更发生后，通过重新运行来有意更新 `.github/health-baseline.json`： ``` cargo run -p why-core --bin why -- health --json --write-baseline .github/health-baseline.json ``` 退出码摘要： - `0` —— 检查通过 - `3` —— CI 阈值失败 (`--ci`) - `4` —— 回归门禁失败 (`--max-regression` / `--max-signal-regression`) ## 路线图 - [ ] GitHub/GitLab PR 标题 + 描述集成（通过 API） - [ ] 从提交消息解析 Jira/Linear 工单 - [ ] `why --since ` 用于最近的变更上下文 - [ ] 团队归咎 —— 谁最了解这段代码？ - [ ] 支持悬停时内联显示 `why` 的 VS Code 扩展

标签：AI编程助手, Anthropic, C2, Cargo, CIS基准, CLI, DevTools, DLL 劫持, Git Blame, Git历史, LLM, OpenAI, Rust, Unmanaged PE, WiFi技术, 云安全监控, 代码分析, 代码审查, 代码理解, 代码重构, 内存规避, 凭证管理, 可视化界面, 大语言模型, 威胁情报, 安全删除, 安全编码, 开发者工具, 源代码管理, 网络流量审计, 通知系统, 静态分析