codegraph-ai/CodeGraph

GitHub: codegraph-ai/CodeGraph

CodeGraph 为代码库构建跨语言的语义图谱，通过 MCP 工具集和 VS Code 扩展让 AI agent 获得结构化的代码理解能力，无需反复 grep 文件。

Stars: 44 | Forks: 6

# CodeGraph **为 AI agents 和开发者提供的跨语言代码智能。** [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE) CodeGraph 为你的代码库构建语义图谱——涵盖函数、类、导入、调用链——并通过 **45 个 MCP 工具**、一个 **VS Code 扩展**以及**持久化记忆层**将其暴露出来。通过 tree-sitter 解析 **37 种语言**。AI agents 可以获得结构化的代码理解能力，而无需在文件中反复执行 grep 操作。 ## 快速开始 ### MCP Server (Claude Code, Cursor, 任何 MCP client) 添加到 `~/.claude.json`（或你的 MCP client 配置中）： ``` { "mcpServers": { "codegraph": { "command": "/path/to/codegraph-server", "args": ["--mcp"] } } } ``` 服务器会自动索引当前工作目录。 ### VS Code 扩展安装 VSIX： ``` code --install-extension codegraph-0.14.0.vsix ``` 该扩展会自动启动服务器，并将所有工具注册为 Copilot 的 Language Model Tools。 ### AI agents 规则预配置的规则文件，教导 AI 编码 agents（Claude、Cursor、 Windsurf、Codex、Cline）在回退到 grep 或多文件读取之前先使用 CodeGraph MCP 工具。将自然语言意图映射到正确的 `codegraph_*` 工具。 → **[codegraph-ai/codegraph-rules-for-agents](https://github.com/codegraph-ai/codegraph-rules-for-agents)** 设置只需 `cp /codegraph.md ~//`（每个 agent 一行——详见规则 repo 的 README）。 ### GitHub Action — CI 中的 PR 审查在你的 repo 中添加一个 workflow，即可在每个 PR 上获得自动的代码图谱分析评论——包括影响范围、测试盲区、过时文档、建议的审查者。以**仅图谱**模式运行（无 embedding，无 ONNX 模型），因此速度很快且不需要 API 密钥——只需内置的 `GITHUB_TOKEN`。将 [`.github/workflows/codegraph-pr.yml`](.github/workflows/codegraph-pr.yml) 复制到你的 repo 中。核心调用仅需一条命令： ``` codegraph-server --graph-only \ --run-tool codegraph_pr_context \ --tool-args '{"baseBranch":"main","format":"markdown"}' ``` 这会输出一段可直接发布的 Markdown 评论。`--graph-only` 标志会跳过 embedding 生成（索引速度提升 10-50 倍）；`--run-tool` 会运行单个工具并在不进行 MCP stdio 握手的情况下退出——非常适合脚本化场景。 ## 配置 ### MCP Server 标志 | 标志 | 默认值 | 描述 | |------|---------|-------------| | `--workspace ` | 当前目录 | 要索引的目录（可重复使用以指定多项目） | | `--exclude ` | — | 要跳过的目录（可重复使用） | | `--embedding-model ` | `bge-small` | `bge-small` (384d, 快速), `jina-code-v2` (768d, 慢 6 倍), `granite-97m` (384d, 32K ctx, 慢约 3 倍), 或 `static` (model2vec, 256d — 索引速度提升约 100 倍，无 ONNX；需要本地模型目录，见下文) | | `--full-body-embedding` | `true` | 对完整的函数体（约 50 行）进行 embedding，以获得更好的语义搜索和重复检测 | | `--max-files ` | 5000 | 最大索引文件数 | | `--profile ` | `all` | 将暴露的 MCP 工具集过滤为指定的子集（见下文） | | `--graph-only` | off | 跳过 embedding 生成——仅构建图谱并提供结构化工具。不加载 ONNX 模型，索引速度提升 10-50 倍。语义搜索不可用。适用于 CI / 一次性图谱查询。 | | `--run-tool ` | — | 一次性模式：索引、运行单个工具、打印结果、退出。无 MCP 握手。与 `--tool-args ''` 配合使用。 | #### `--embedding-model static` — model2vec 快速索引静态 (model2vec) embedding 使用 token→vector 查找表替代了 ONNX transformer：索引速度**快约 100 倍**（本 repo 的 5,873 个 symbol 的 embedding 耗时约 1 秒，而 BGE 约需 3.4 分钟），并且**没有 ONNX runtime 或 1.5 GB RAM 门槛**。检索保持**混合模式（BM25 + 语义）**，因此端到端质量达到 BGE 的 **~90%**。VS Code 扩展内置了该模型，因此 `static` 在那里无需设置即可使用。对于 CLI/MCP server，它需要一个本地模型目录（`config.json` + `tokenizer.json` + `model.safetensors`）： - 使用 `CODEGRAPH_STATIC_MODEL=/path/to/model` 指向它（或在 VS Code 中使用 `codegraph.staticModelPath` 设置覆盖内置模型）。默认： `~/.codegraph/static_models/jina-code-static-256`。 - 在 CPU 上用约 30 秒即可从任何 sentence-transformer（默认为 Apache-2.0 Jina-Code）中蒸馏出一个： `python scripts/distill_static_model.py`。 #### `--profile` — 缩小 MCP 工具集完整的 32 个工具集虽然方便，但会增加 agent 的 prompt-context 成本。profile 仅暴露你需要的切片（也可通过 `CODEGRAPH_TOOL_PROFILE` 环境变量设置）： | Profile | Tools | 适用场景 | |---------|-------|----------| | `all` *(默认)* | 每个工具 (社区版 + 专业版) | 常规会话 | | `core` | 8 — 搜索 + symbol 信息 + AI context | 仅需查找的闲聊式 agent 会话 | | `graph` | 16 — callers/callees/deps/impact/traverse | 重构 + 结构化分析 | | `memory` | 7 — 仅 `codegraph_memory_*` | 笔记 / 知识库工作流 | | `security` | 仅专业版安全工具（社区版为空） | 专业版安全审计 | ### VS Code 设置 ``` { "codegraph.indexOnStartup": true, "codegraph.indexPaths": ["/path/to/project-a", "/path/to/project-b"], "codegraph.excludePatterns": ["**/cmake-build-debug/**", "**/generated/**"], "codegraph.embeddingModel": "bge-small", // or "static" for ~100× faster indexing "codegraph.staticModelPath": "", // model2vec model dir when embeddingModel is "static" "codegraph.maxFileSizeKB": 1024, "codegraph.debug": false } ``` 默认情况下启用 full-body embedding。函数体文本在解析时捕获，I/O 开销为零。内置排除项（始终跳过）涵盖三大类约 47 个目录： - **构建 / 缓存**：`node_modules`、`target`、`dist`、`build`、`out`、`.git`、`__pycache__`、`vendor`、`.venv`、`venv`、`.tox`、`.pytest_cache`、`.mypy_cache`、`.ruff_cache`、`.next`、`.nuxt`、`.svelte-kit`、`.parcel-cache`、`.npm`、`.yarn`、`.pnpm-store`、`.cache`、`.cargo`、`.bundle`、`.gradle`、`DerivedData`、`Pods`、`xcuserdata`、`cmake-build-*` - **IDE / IaC 状态**：`.idea`、`.vscode-test`、`.fleet`、`.terraform`、`.terragrunt-cache`、`.serverless` - **敏感凭证目录**：`.aws`、`.ssh`、`.gnupg`、`.kube`、`.docker` 此外还有针对二进制归档文件、原生库、操作系统元数据的 glob 匹配模式，以及**敏感文件扩展名**（`*.pem`、`*.key`、`*.p12`、`*.pfx`、`*.crt`、`*.gpg`、`*.kdbx`、类似 `id_rsa` 的 SSH 密钥约定等）——纵深防御意外对凭证进行 embedding。 ## 工具（42 个社区版 + 27 个专业版，17 个安全工具） ### 代码分析 (11) | Tool | 功能 | |------|-------------| | `get_ai_context` | **主要上下文工具。** 意图感知（解释/修改/调试/测试），具备 token 预算控制。返回源码、相关 symbol、导入、同级内容、调试提示。 | | `get_edit_context` | 编辑前所需的一切：源码 + callers + 测试 + 记忆 + git 历史 | | `get_curated_context` | 针对自然语言查询（“身份验证是如何工作的？”）的跨代码库上下文 | | `analyze_impact` | 影响范围预测——如果你修改、删除或重命名会发生什么中断 | | `analyze_complexity` | 圈复杂度及其细分（分支、循环、嵌套、异常、提前返回） | | `find_circular_deps` | 检测跨文件的循环导入/依赖链 | | `find_hot_paths` | 按传递性 caller 数量排名的最常调用函数 | | `find_dead_imports` | 查找未使用的导入——已导入但从未被引用的模块 | | `get_module_summary` | 目录的高层概览：文件数、函数数、语言构成、顶级复杂函数 | | `search_by_pattern` | 跨函数体、签名、名称和文档注释的正则表达式搜索 | | `search_by_error` | 查找抛出、捕获或处理特定错误类型的函数 | ### 代码导航 (13) | Tool | 功能 | |------|-------------| | `symbol_search` | 按名称或自然语言查找 symbol（混合 BM25 + 语义搜索） | | `get_callers` / `get_callees` | 谁调用了它？它调用了什么？（支持传递深度） | | `get_detailed_symbol` | 完整的 symbol 信息：源码、callers、callees、复杂度 | | `get_symbol_info` | 快速元数据：签名、可见性、种类 | | `get_dependency_graph` | 支持深度控制的文件/模块导入关系 | | `get_call_graph` | 函数调用链（callers 和 callees） | | `find_by_imports` | 查找导入了某个模块的文件 | | `find_by_signature` | 按参数数量、返回类型、修饰符搜索 | | `find_entry_points` | Main 函数、HTTP handler、CLI 命令、事件处理程序 | | `find_implementors` | 查找所有注册为 ops 结构体回调的函数 | | `find_related_tests` | 测试指定函数的测试用例 | | `traverse_graph` | 带有边/节点类型过滤器的自定义图谱遍历 | ### 索引 (3) | Tool | 功能 | |------|-------------| | `reindex_workspace` | 全量或增量工作区重新索引 | | `index_files` | 无需完全重新索引即可添加/更新特定文件 | | `index_directory` | 将目录与现有数据一起添加到图谱中 | ### 记忆 (7) 跨会话的持久化 AI 上下文——调试见解、架构决策、已知问题。 | Tool | 功能 | |------|-------------| | `memory_store` / `memory_get` / `memory_search` | 存储、检索、搜索记忆（BM25 + 语义） | | `memory_context` | 获取与文件/函数相关的记忆 | | `memory_list` / `memory_invalidate` / `memory_stats` | 浏览、废弃、监控 | 与 [Tempera](https://github.com/anvanster/tempera) 搭配使用效果极佳——这是一个情景记忆系统，可以捕获跨项目的可迁移调试策略和解决方案。CodeGraph 的记忆工具用于存储项目范围内的笔记；而 Tempera 则捕获随着时间推移不断改进的跨项目 BKMs（最佳已知方法）。 ### PR / 变更分析 (1) | Tool | 功能 | |------|-------------| | `pr_context` | **一键 PR 审查。** 运行 git diff 对比 base 分支，在图谱中查找已更改的函数，并报告：影响范围（callers）、测试覆盖率及盲区、受影响模块、diff 感知的变更分类（签名对比函数体）、过时文档警告、复杂度、commit-message 提示、通过 git blame 建议的审查者。 | ### 文档 (7) 持久化的项目文档——索引设计文档、进行语义搜索、验证代码是否符合设计、从代码图谱生成架构文档。 | Tool | 功能 | |------|-------------| | `index_markdown` | 将本地 `.md` 文件（ARCHITECTURE.md、API_DESIGN.md 等）索引到持久化文档存储中。采用带有叶子节点 embedding 的标题树分块。 | | `search_docs` | 对已索引的文档进行语义搜索——返回带有标题路径面包屑导航的匹配段落 | | `list_doc_sources` | 列出所有已索引的源文件 | | `remove_doc_source` | 从源文件中移除所有已索引的分块 | | `verify_design` | 交叉比对文档声明与代码图谱。`direction=forward` (文档→代码)、`reverse` (代码→文档) 或 `both` | | `design_gaps` | 查找文档中描述但尚未在代码中实现的标识符——从规范构建 TODO 列表 | | `generate_architecture_doc | 从实时代码图谱自动生成结构化的 ARCHITECTURE.md（模块、热点路径、复杂度、循环依赖） | 所有工具名称均以 `codegraph_` 为前缀（例如 `codegraph_get_ai_context`）。针对特定 symbol 的工具接受 `uri` + `line` 或来自 `symbol_search` 结果的 `nodeId`。 ### 使用示例 **索引设计文档并进行搜索：** ``` codegraph_index_markdown(path: "/projects/myapp/docs/ARCHITECTURE.md") codegraph_search_docs(query: "how does the auth module handle JWT refresh?") ``` **检查代码是否符合设计：** ``` codegraph_verify_design(source: "/projects/myapp/docs/ARCHITECTURE.md", direction: "forward") // → "132/132 identifiers verified, 0 gaps" ``` **查找文档中描述但尚未实现的内容：** ``` codegraph_design_gaps(source: "/projects/myapp/docs/API_DESIGN.md") // → "4 of 12 identifiers not found in code: PaymentService, RefundHandler, ..." ``` **从代码图谱生成架构文档：** ``` codegraph_generate_architecture_doc(scope: "src/", topN: 5) // → Markdown with modules, complexity hotspots, hot paths, circular deps ``` **保存调试见解供未来会话使用：** ``` codegraph_memory_store(kind: "debug_context", title: "Nginx body size limit", content: "The /upload endpoint fails on payloads > 1MB...", problem: "API returns 500 on large uploads", solution: "Increase nginx client_max_body_size to 10M", agentSource: "claude") ``` **获取带有图谱压缩统计信息 + 设计文档增强的 AI 上下文：** ``` codegraph_get_ai_context(uri: "file:///projects/myapp/src/auth.rs", line: 42, intent: "modify") // → Code context + graphStats: {entitiesInGraph: 13555, entitiesTraversed: 47, entitiesKept: 8} // → design_context section from indexed docs mentioning "auth" ``` **审查 PR——一键查看影响范围、测试盲区、过时文档和审查者：** ``` codegraph_pr_context(baseBranch: "main") // → "PR changes 4 files (+263/-77, 12 functions). 37 direct callers, 8 tests, 3 untested. Risk: medium." // → test_gaps: [refresh_token, revoke_session] — functions with 0 test callers // → stale_docs: ["auth.rs described in ARCHITECTURE.md > Authentication — doc may need updating"] // → suggested_reviewers: [{author: "anvanster", lines_owned: 3200}] // → commit_hint: "feat(mcp): " ``` **缩小闲聊会话的工具集：** ``` codegraph-server --mcp --profile=core # Only 8 tools: search + symbol info + AI context ``` ### CodeGraph Pro 在 [CodeGraph Pro](https://codegraph.astudioplus.com/pro) 中可用的额外工具： | Tool | 功能 | |------|-------------| | `scan_security` | 安全漏洞扫描：40+ 种危险函数模式、source-to-sink 污点追踪、针对 HTTP endpoint 的认证覆盖率（7 种语言/框架）、架构层违规、弱加密、硬编码密钥 | | `analyze_coupling` | 模块耦合指标和不稳定性评分 | | `find_unused_code` | 具备置信度评分的死代码检测 | | `find_duplicates` | 检测重复/近似重复的函数 | | `find_similar` / `cluster_symbols` / `compare_symbols` | 基于 embedding 的代码相似度分析 | | `cross_project_search` | 跨所有已索引项目进行搜索 | | `mine_git_history` / `mine_git_history_for_file` / `search_git_history` | Git 历史挖掘与语义搜索 | | `security_control_flow` | 映射通过函数的每一条执行路径——例如“是否可以在未通过身份验证的情况下返回？” | | `security_trace_data_flow` | 跟踪变量从产生到消亡的完整生命周期——例如“用户输入是否触达了这个 SQL 查询？” | | `security_generate_sbom` | 根据 8 种 lockfile 格式生成 CycloneDX SBOM | | `security_audit_deps` | 针对依赖项进行 OSV 漏洞检查 | | `security_check_unchecked_returns` / `_resource_leaks` / `_misconfig` / `_input_validation` / `_error_exposure` | 5 种启发式分析器，覆盖约 80% 的 CWE Top 25 | | `security_scan_iac` | Docker / Kubernetes / Terraform 配置不当扫描 | | `security_check_licenses` | Lockfile 许可证策略强制执行（检测 copyleft） | | `security_check_secrets_entropy` | 香农熵硬编码密钥检测 | | `security_detect_injection` | 专注的 SQL/XSS/cmd/path/deser/template 注入检测（20 种模式） | | `security_check_search_path` | 检测不可信搜索路径 / DLL 劫持（CWE-426/CWE-427） | | `security_check_crypto` | 加密误用：弱加密算法/哈希/PRNG/密钥、静态 IV、时序泄漏比较（CWE-208/326-330/338/916，35 种模式） | | `security_export_sarif` | 将发现的问题聚合为 SARIF 2.1.0 导出（GitHub Code Scanning, GitLab SAST） | **交叉特性（所有 `security_check_*` 工具）：** - `include_tests` / `treat_as_production` — 将测试/示例/第三方代码（vendored）的跳过视为一等公民 - `check_compile_gates` — 当 X 未被 CMake/Cargo/Makefile 定义时，`#ifdef X` 内部的 C/C++ 发现将被标记为 DEFENSIVE_GATED_OFF - 遵循 25 种标记抑制规则（`# nosec`、`// NOLINT`、`// codeql[ignore]`、`# rubocop:disable` 等），作用于行和函数级别 - 每次扫描的遥测数据块：`path_filter`（已检查/已匹配/已跳过）+ `compile_gate`（gated_off 计数） ## 语言通过 tree-sitter 解析 38 种语言——支持函数、类、导入、调用图、复杂度指标、依赖图、符号搜索和影响分析： | 类别 | 语言 | |---|---| | **Systems** | C, C++, Rust, Zig, Objective-C | | **JVM** | Java, Kotlin, Scala, Groovy, Clojure | | **Web/Scripting** | TypeScript/JS, Python, Ruby, PHP, Perl, Lua, Elixir, Elm | | **Web/Style** | CSS | | **Mobile** | Swift, Dart | | **Functional** | Haskell, OCaml, Julia, Erlang, Elm, Clojure | | **Enterprise** | C#, COBOL, Fortran, Go | | **Blockchain** | Solidity | | **Shell/Config** | Bash, HCL/Terraform, TOML, YAML | | **Hardware** | Verilog/SystemVerilog, Tcl | | **Data Science** | R, Julia | HTTP handler 检测：Python (FastAPI/Flask/Django)、TypeScript (NestJS)、Java (Spring/JAX-RS)、Go (stdlib/Gin/Echo/Fiber)、C# (ASP.NET)、Ruby (Rails)、PHP (Laravel/Symfony)。 ## 架构 ``` MCP Client (Claude, Cursor, ...) VS Code Extension | | MCP (stdio) LSP Protocol | | └───────────┐ ┌───────────┘ ▼ ▼ ┌─────────────────────────────┐ │ codegraph-server │ ├─────────────────────────────┤ │ 38 tree-sitter parsers │ │ Semantic graph engine │ │ AI query engine (BM25) │ │ Memory layer (RocksDB) │ │ Docs store (RocksDB+HNSW) │ │ Full-body embeddings (BGE) │ │ HNSW vector index │ └─────────────────────────────┘ ``` 一个单一的 Rust 二进制文件同时提供 MCP 和 LSP 协议服务。 - **索引**：约 60 个文件/秒。通过 FNV-1a 内容哈希在文件更改时进行增量重新索引。 - **持久化**：图谱和 embedding 持久化到 `~/.codegraph/graph.db` (RocksDB)。重启时实现瞬时启动——无需重新解析，无需重新 embedding。 - **查询**：100 毫秒以内。在索引时完成跨文件导入和调用解析。 - **Embedding**：全函数体（在解析时捕获函数体，零磁盘 I/O）。向量与图谱一起存储在 RocksDB 中。首次运行时自动下载模型。 ## 从源码构建 ``` git clone https://github.com/codegraph-ai/codegraph cd codegraph cargo build --release -p codegraph-server # Rust server cd vscode && npm install && npm run esbuild # VS Code extension npx @vscode/vsce package # VSIX ``` 需要 Rust stable、Node.js 18+、VS Code 1.90+。 ## 支持本项目 CodeGraph 是免费、开源的，并由一位独立开发者维护。如果它节省了你的时间，请考虑[在 GitHub 上赞助](https://github.com/sponsors/anvanster)——这有助于保持项目的生命力和成长。 ## License Apache-2.0

标签：AI辅助编程, CNCF毕业项目, MCP, SOC Prime, 代码理解, 代码知识图谱, 可视化界面, 客户端加密, 开发工具, 错误基检测, 静态代码分析