CristianoCiuti/reponova

GitHub: CristianoCiuti/reponova

将代码库转化为知识图谱并提供 11 个 MCP 工具，让 AI 代码助手具备全局结构化理解能力。

Stars: 0 | Forks: 0

npm downloads license

RepoNova

🤖 RepoNova 🔭

将您的代码库转化为知识图谱。使用 AI 进行查询。

知识图谱构建器 & MCP 服务器，专为 AI 代码助手设计。
从您的代码中提取符号、关系和语义——然后将整个结构以
11 个图谱工具的形式公开，供任何兼容 MCP 的智能体使用。

## 为什么选择 RepoNova？ AI 智能体一次只能读取一个文件。它们无法理解代码库是如何组合在一起的——哪些函数调用了什么，哪些模块依赖于哪些模块，架构瓶颈在哪里。 **RepoNova 解决了这个问题。** 它会为您的整个代码库（或多个仓库）构建一个持久化的知识图谱，并为您的 AI 智能体提供 11 个专门的工具来进行查询：搜索、影响分析、最短路径、语义相似度、社区检测等等。 ### 它的独特之处 - **零外部依赖** — 不需要 Python，不需要 Docker，也不需要数据库服务器。纯 Node.js 实现 - **多仓库支持** — 构建一个跨越多个仓库的图谱 - **智能增量构建** — SHA256 文件哈希、配置变更检测、语义图谱差异比对、选择性子系统重建 - **本地 LLM 增强** — 可选的本地 LLM，用于生成更丰富的社区摘要和节点描述（在 CPU 上运行） - **11 个 MCP 工具** — 从文本搜索到加权 Dijkstra，从语义相似度到结构化查询 - **兼容任何 MCP 客户端** — OpenCode, Cursor, Claude Code, VS Code Copilot ## 工作原理 ``` Your Codebase reponova build AI Agent ───────────── ────────────── ──────── Python / TS / JS 1. tree-sitter AST parsing graph_search Markdown / Docs ──────────► 2. Symbol + edge extraction ──► graph_impact Diagrams / SVG 3. Louvain communities graph_path Multi-repo 4. TF-IDF / ONNX embeddings graph_similar 5. Community summaries 6. HTML visualizations ... (11 tools) ``` ## 安装 ``` npm install -g reponova ``` 或者直接运行而无需安装： ``` npx reponova ``` 需要 **Node.js >= 18**。 ## 快速入门 ### 1. 安装到您的编辑器 ``` reponova install --target opencode ``` 这会注册 MCP 服务器，安装钩子/技能，并写入默认的 `reponova.yml` 配置。支持的编辑器：`opencode`, `cursor`, `claude`, `vscode` ### 2. 构建知识图谱 ``` reponova build ``` ### 3. 开始使用 MCP 服务器会随您的编辑器自动启动。您的 AI 智能体现在可以使用全部 11 个图谱工具了。 ``` You: "What would be the impact of refactoring the authenticate function?" Agent: [calls graph_impact] → shows upstream/downstream blast radius across repos ``` ## MCP 工具通过 MCP (stdio) 暴露的 11 个专门工具。每个工具都为特定的查询模式而设计。 | 工具 | 描述 | |------|-------------| | `graph_search` | 🔍 跨节点的全文搜索。可按类型、仓库过滤。使用 BFS/DFS 扩展结果。 | | `graph_impact` | 💥 影响范围分析——查找任何符号的所有上游/下游依赖项。 | | `graph_path` | 🛤️ 两个符号之间的加权最短路径 (Dijkstra)。可按边类型过滤。 | | `graph_explain` | 📋 节点的完整详情：边、社区、中心性指标、签名、文档字符串。 | | `graph_similar` | 🧲 使用 TF-IDF 或 ONNX 向量嵌入进行语义相似度搜索。 | | `graph_context` | 🧠 带有 token 预算的智能上下文构建器——结合搜索 + 向量 + 图谱扩展。 | | `graph_community` | 🏘️ 列出社区中的所有节点，按度中心性排名。 | | `graph_hotspots` | 🔥 核心节点 / 架构瓶颈——图谱中连接最多的符号。 | | `graph_outline` | 🗂️ Tree-sitter 代码大纲：函数、类、导入及其签名和行范围。 | | `graph_docs` | 📄 搜索文档节点（markdown, text, rst）。 | | `graph_status` | 📊 图谱元数据：节点/边数量、仓库、构建时间戳、reponova 版本、构建配置。 | ## 智能体工作流 RepoNova 被设计为 AI 编程智能体的**结构化记忆层**。以下是如何在智能体工作流中有效使用它。 ### 推荐的智能体模式 **在任何重构之前：** ``` 1. graph_impact "TargetFunction" → understand blast radius 2. graph_path "ModuleA" "ModuleB" → see dependency chain 3. graph_community 5 → understand the module cluster 4. Make changes with full structural awareness ``` **在探索陌生代码时：** ``` 1. graph_status → understand graph size and repos 2. graph_hotspots → identify architectural pillars 3. graph_search "authentication" → find entry points 4. graph_explain "Function:authenticate" → deep dive ``` **当回答“X 在哪里被使用？”时：** ``` 1. graph_search "X" → find the node 2. graph_impact "X" direction=downstream → who depends on it 3. graph_similar "X" → find semantically related code ``` ### 与编辑器技能集成 `reponova install` 命令会安装一个**技能文件**和一个**钩子/规则**，用于教导您的 AI 智能体何时以及如何使用每个工具。智能体在需要结构化信息时，会自动寻求图谱工具的帮助。 | 编辑器 | MCP 配置 | 钩子 / 规则 | 技能 | 配置文件 | |--------|-----------|-------------|-------|--------| | OpenCode | `.opencode/opencode.json` | `.opencode/plugins/reponova.js` | `.opencode/skills/reponova/SKILL.md` | `.opencode/reponova.yml` | | Cursor | `.cursor/mcp.json` | `.cursor/rules/reponova.mdc` | *(嵌入在规则中)* | `.cursor/reponova.yml` | | Claude Code | `claude mcp add` | `.claude/settings.json` | `.claude/skills/reponova/SKILL.md` | `.claude/reponova.yml` | | VS Code | `.vscode/mcp.json` | `.github/copilot-instructions.md` | *(嵌入在指令中)* | `.vscode/reponova.yml` | ### 保持图谱最新 ``` # 增量重建 — 仅处理更改的文件 reponova build # 完全清除重建 reponova build --force ``` ### 增量构建的工作原理 RepoNova 的增量构建超越了简单的文件更改检测。它将管道每个阶段的冗余工作降至最低： | 层级 | 作用 | 触发时机 | |-------|-------------|-----------------| | **文件哈希** | 每个文件的 SHA256——仅重新解析已更改/新增的文件。也会检测已删除的文件。 | 每次增量构建 | | **配置指纹** | 比较跨构建的与相关配置字段的哈希值。 | 当构建间的 `reponova.yml` 发生变化时 | | **选择性子系统重建** | 仅重新运行受配置更改影响的子系统（例如，切换 `embeddings.method` 会重新运行嵌入但不重新解析）。 | 仅配置更改（无文件更改） | | **语义图谱哈希** | 完整图谱结构（排序后的节点 + 边）的 SHA256。当图谱未改变时跳过下游步骤。 | 图谱构建后 | | **增量嵌入** | 跟踪每个节点的文本内容。仅重新嵌入文本已更改的节点。 | 启用嵌入时的每次增量构建 | | **大纲哈希** | 源文件大纲的 SHA256。跳过未更改文件的大纲重新生成。 | 启用大纲时的每次增量构建 | | **过期产物清理** | 当配置更改使产物失效时删除过时的产物（例如，切换到 ONNX 后删除 `tfidf_idf.json`）。 | 在配置更改检测之后 | | **提前返回** | 如果没有文件更改、没有文件删除，且没有配置更改——立即退出。 | 每次增量构建 | 构建配置指纹存储在 `graph.json` 元数据中，因此它可以在构建之间持久存在而无需额外文件。 ## CLI 参考 ### `reponova install` 设置编辑器集成。创建 MCP 配置、钩子、技能和 `reponova.yml`。 ``` reponova install --target [--graph ] ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--target` | **是** | 要配置的编辑器。取值范围：`opencode`, `cursor`, `claude`, `vscode` | | `--graph` | 否 | `reponova-out/` 目录的路径。默认值：`./reponova-out` | ### `reponova build` 构建（或重新构建）知识图谱。 ``` reponova build [--config ] [--force] ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--config` | 否 | `reponova.yml` 的路径。默认值：自动检测（参见[配置解析](#config-resolution)） | | `--force` | 否 | 删除输出目录并从头开始重新构建。默认值：`false` | **构建管道：** 1. 检测源文件、文档和图表（通过 [picomatch](https://github.com/micromatch/picomatch) 进行集中式 glob 匹配） 2. 将文件与上次构建进行对比——检测已更改、新增和**已删除**的文件 3. 比较构建配置指纹——检测仅配置的更改（例如，切换 `embeddings.method`） 4. 使用 tree-sitter WASM 解析更改的文件——提取符号、调用、导入、继承（未更改的文件复用缓存的提取结果） 5. 构建带有跨文件 / 跨仓库边的有向图 6. 计算语义图谱哈希——当图谱结构未改变时跳过下游步骤 7. 当配置更改使产物失效时清理过时的产物（例如，切换到 ONNX 后的旧 TF-IDF 文件） 8. 检测社区（Louvain 算法） 9. 增量生成嵌入——仅重新嵌入文本内容已更改的节点（TF-IDF 或 ONNX MiniLM） 10. 选择性地重新生成受配置更改影响的子系统（嵌入、摘要、描述、大纲、HTML） 11. 生成社区摘要 + 节点描述（算法生成或 LLM 增强） 12. 生成 `graph.html` 和 `graph_communities.html` 交互式可视化文件 13. 生成 SQLite 搜索索引 (`graph_search.db`) 14. 生成代码大纲（每个文件的 SHA256 哈希——跳过未更改的文件）和 `report.md` ### `reponova index` 从现有的 `graph.json` 重新生成 SQLite 搜索索引 (`graph_search.db`)。在索引损坏或被删除但不需要完全重新构建时非常有用。 ``` reponova index [--graph ] ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--graph` | 否 | `reponova-out/` 目录的路径。默认值：自动检测 | ### `reponova outline` 为匹配 `outlines.patterns` 的文件预计算 tree-sitter 代码大纲。运行与 `reponova build` 相同的大纲逻辑，但作为独立程序——适用于在不重新运行提取或图谱构建的情况下重新生成大纲。 ``` reponova outline [--config ] [--graph ] [--force] ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--config` | 否 | `reponova.yml` 的路径。默认值：自动检测 | | `--graph` | 否 | `reponova-out/` 目录的路径。默认值：从配置的 `output` 字段解析 | | `--force` | 否 | 重新生成所有大纲，忽略 SHA256 文件哈希。默认值：`false` | ### `reponova mcp` 通过 stdio 启动 MCP 服务器。通常由编辑器自动启动。 ``` reponova mcp [--graph ] ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--graph` | 否 | `reponova-out/` 目录的路径。默认值：自动检测 | ### `reponova models` 管理本地 AI 模型（ONNX 嵌入、LLM）。详情请参阅[模型](#models)。 ``` reponova models status # Show configured and cached models reponova models download # Pre-download all models needed by config reponova models remove # Remove a specific cached model reponova models clear # Remove all cached models ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--config` | 否 | `reponova.yml` 的路径。默认值：自动检测 | | `--cache-dir` | 否 | 覆盖模型缓存目录 | ### `reponova check` 验证图谱安装情况、构建完整性并报告统计信息。 ``` reponova check [--graph ] ``` | 选项 | 是否必填 | 描述 | |--------|----------|-------------| | `--graph` | 否 | `reponova-out/` 目录的路径。默认值：自动检测 | 执行的检查： - 图谱文件存在且可加载 - 节点/边数量和仓库列表 - 构建元数据是否存在（`build_config` 指纹，reponova 版本） - 嵌入产物一致性（TF-IDF IDF 文件，ONNX 向量） - 如果配置中的嵌入方法与构建的产物不匹配，则发出警告 ## 支持的语言 ### 提取（AST 解析 + 图谱构建） | 语言 | 扩展名 | 解析器 | 节点类型 | |----------|-----------|--------|------------| | Python | `.py`, `.pyw` | tree-sitter-python (WASM) | `function`, `class`, `method`, `module`, `constant` | | Markdown | `.md`, `.txt`, `.rst` | 内置 | `document`, `section`, `heading` | | 图表 | `.puml`, `.plantuml`,svg` | 内置 | `diagram`, `component` | ### 大纲（tree-sitter 代码大纲） | 语言 | 扩展名 | 大纲支持 | |----------|-----------|-----------------| | Python | `.py`, `.pyw` | 完整：函数、类、方法、导入、签名、装饰器、文档字符串 | ### 边类型图谱中的每条边都有一个描述关系的类型： | 边类型 | 描述 | 示例 | |-----------|-------------|---------| | `calls` | 函数/方法调用 | `process_data` → `validate_input` | | `imports` | 模块级导入 | `api.py` → `models.py` | | `imports_from` | 特定符号的命名导入 | `api.py` → `UserModel` | | `extends` | 类继承 | `AdminUser` → `BaseUser` | | `contains` | 模块包含符号 | `auth.py` → `login()` | | `contains_section` | 文档包含章节 | `README.md` → `Installation` | | `method` | 类包含方法 | `UserService` → `get_user()` | ## 配置 ### 配置解析配置文件会从以下位置自动检测（首次匹配生效）： 1. 显式的 `--config` 参数 2. 项目根目录中的 `reponova.yml` 3. `.opencode/reponova.yml` 4. `.cursor/reponova.yml` 5. `.claude/reponova.yml` 6. `.vscode/reponova.yml` 配置内部的所有路径都**相对于配置文件所在的位置**。当放置在编辑器目录（例如 `.opencode/`）中时，请使用 `../` 来引用项目根目录。 ### 模式解析所有的 glob 模式（`build.patterns`, `build.exclude`, `docs.patterns`, `outlines.patterns` 等）都是针对**相对于工作区的路径**进行匹配的。这些路径的具体形式取决于仓库的数量。 #### 单一仓库只有一个仓库时，文件路径相对于仓库根目录——没有前缀： ``` src/core.py src/utils/helpers.py tests/test_core.py ``` 模式会按您期望的方式工作： ``` repos: - name: my-project path: . build: patterns: ["src/**/*.py"] # matches src/core.py ✓ exclude: ["tests/**"] # excludes tests/test_core.py ✓ ``` #### 多仓库有多个仓库时，每个文件路径都**以配置中的仓库名称作为前缀**： ``` api/src/routes.py # ← "api" comes from repos[].name api/src/handlers.py core/src/models.py # ← "core" comes from repos[].name core/src/db.py ``` 模式会针对**两种形式**进行测试——完整的带前缀路径和相对于仓库的路径——因此相同的模式在单仓库和多仓库中均可工作： ``` repos: - name: api path: ../services/api - name: core path: ../services/core build: patterns: ["src/**/*.py"] # matches api/src/routes.py, api/src/handlers.py, core/src/models.py, core/src/db.py ✓ (via repo-relative) exclude: ["**/test_*.py"] # works across all repos ``` #### 过滤特定仓库使用仓库名称作为路径前缀，以仅针对一个仓库： ``` build: exclude: - "api/src/generated/**" # excludes only in the api repo - "**/migrations/**" # excludes in all repos outlines: patterns: - "core/src/**/*.py" # outlines only for the core repo ``` 这之所以有效，是因为完整的工作区路径始终是 `/`。仓库名称是您 `repos` 配置中的 `name` 字段——而不是磁盘上的目录名称。 ### 完整配置参考每个字段、每个有效值、每个默认值。 ``` # ────────────────────────────────────────────────────────────────────────────── # reponova.yml — 完整配置参考 # ────────────────────────────────────────────────────────────────────────────── # 构建输出 (graph.json, graph.html, graph_search.db 等) 的写入位置 # 类型：字符串 # 默认值："reponova-out" output: ../reponova-out # ── 代码库 ────────────────────────────────────────────────────────────────────── # 包含在构建中的代码库列表。 # 每个 repo 需要一个唯一的名称和一个路径 (相对于此配置文件)。 repos: - name: api-service # string — unique identifier for this repo path: ../services/api # string — path to repo root (relative to this file) - name: core-lib path: ../services/core # ── 集中式模型管理 ───────────────────────────────────────────────────────────── # 所有模型 (LLM, ONNX embeddings) 的共享设置。 # 单独的功能 (community_summaries, node_descriptions) 可以通过 `model` 字段指定 # 它们自己的模型。这些设置适用于所有这些功能。 models: # Directory to cache downloaded models (ONNX embeddings + LLM weights) # Type: string # Default: "~/.cache/reponova/models" cache_dir: ~/.cache/reponova/models # GPU acceleration backend for LLM inference # Values: "auto" | "cpu" | "cuda" | "metal" | "vulkan" # - auto: auto-detect best available backend # - cpu: force CPU inference (slower but always works) # - cuda: NVIDIA GPU (requires CUDA drivers) # - metal: Apple Silicon GPU (macOS only) # - vulkan: Cross-platform GPU (AMD, Intel, NVIDIA) # Default: "auto" gpu: auto # Number of CPU threads for LLM inference # Type: number # Default: 0 (auto-detect based on available cores) threads: 0 # Automatically download models on first use # Type: boolean # Default: true download_on_first_use: true # ── 构建选项 ───────────────────────────────────────────────────────────────────── build: # Glob patterns for source code files to include # Type: string[] # Default: [] (empty = auto-detect by file extension using registered extractors) # Example: ["src/**/*.py", "lib/**/*.ts"] patterns: [] # Glob patterns to exclude from source code detection # Type: string[] # Default: [] # Example: ["**/generated/**", "**/*.test.ts", "**/vendor/**"] exclude: [] # Exclude common non-source directories from all file detection # (source code, documentation, diagrams, and outlines). # When true, the following directories are skipped at any depth: # node_modules, __pycache__, .git, .svn, .hg, venv, .venv, env, .env, .tox, # site-packages, dist, build, .eggs, .mypy_cache, .pytest_cache, .ruff_cache, # target, bin, obj # Set to false if you need to index files inside these directories # (e.g. vendored code in node_modules). You can still exclude specific # directories via the `exclude` patterns above. # Type: boolean # Default: true exclude_common: true # Incremental builds: only re-process files whose SHA256 hash changed # Type: boolean # Default: true incremental: true # Generate interactive HTML visualizations (graph.html + graph_communities.html) # Type: boolean # Default: true html: true # Minimum node degree to include in HTML visualization # Useful for large graphs — filters out leaf nodes to reduce clutter # Type: integer (>= 1) # Default: not set (include all nodes) # html_min_degree: 3 # ── Documentation Extraction ────────────────────────────────────────────── docs: # Enable/disable documentation extraction # Type: boolean # Default: true enabled: true # Glob patterns for documentation files (relative to repo root) # Type: string[] # Default: ["**/*.md", "**/*.txt", "**/*.rst"] patterns: - "**/*.md" - "**/*.txt" - "**/*.rst" # Glob patterns to exclude from documentation extraction # Type: string[] # Default: ["**/CHANGELOG.md", "**/node_modules/**"] # Note: if your output dir is inside the workspace (e.g. output: ./reponova-out), # add it here to prevent generated files from being re-ingested on rebuild. exclude: - "**/CHANGELOG.md" - "**/node_modules/**" - "reponova-out/**" # Maximum file size in KB — files larger than this are skipped # Type: number # Default: 500 max_file_size_kb: 500 # ── Diagram / Image Extraction ──────────────────────────────────────────── images: # Enable/disable diagram extraction # Type: boolean # Default: true enabled: true # Glob patterns for diagram files (relative to repo root) # Type: string[] # Default: ["**/*.puml", "**/*.plantuml", "**/*.svg"] patterns: - "**/*.puml" - "**/*.plantuml" - "**/*.svg" # Glob patterns to exclude # Type: string[] # Default: ["**/node_modules/**"] exclude: - "**/node_modules/**" # Parse PlantUML files to extract components and relationships # Type: boolean # Default: true parse_puml: true # Extract text content from SVG files # Type: boolean # Default: true parse_svg_text: true # ── Embeddings ──────────────────────────────────────────────────────────── # Vector representations for semantic search (graph_similar, graph_context) embeddings: # Enable/disable embedding generation # Type: boolean # Default: true enabled: true # Embedding method # Values: "tfidf" | "onnx" # - tfidf: Feature-hashed TF-IDF (384-dim). Fast (milliseconds). No model download. # - onnx: MiniLM-L6-v2 via ONNX Runtime (384-dim). More accurate. ~86MB model download. # Default: "tfidf" method: tfidf # ONNX model name (only used when method: "onnx") # Must be a sentence-transformers/ model on HuggingFace with ONNX export # and BERT-compatible tokenizer. Dimensions must match 'dimensions' below. # See the "Models" section for compatible models and details. # Type: string # Default: "all-MiniLM-L6-v2" model: all-MiniLM-L6-v2 # Embedding vector dimensions # Type: number # Default: 384 dimensions: 384 # Batch size for ONNX inference # Type: number # Default: 128 batch_size: 128 # ── Community Summaries ─────────────────────────────────────────────────── # Natural-language summaries for each detected community (cluster of related symbols). # Independent from node descriptions — can enable one without the other. community_summaries: # Enable/disable community summary generation # Type: boolean # Default: true enabled: true # Maximum number of communities to summarize # Type: integer (>= 0) # Default: 0 (no limit — summarize all communities) # Communities are sorted by size (largest first). When max_number > 0, # only the top N largest communities are summarized. # Communities with fewer than 3 nodes are always excluded. max_number: 0 # LLM model for richer summaries (optional) # Uses hf: URI notation — see the "Models" section for details. # Type: string | null # Default: null (algorithmic summaries — still useful, just less prose) # model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M" # Context window size for LLM inference (only used when model is set) # Type: number # Default: 512 context_size: 512 # ── Node Descriptions ───────────────────────────────────────────────────── # Natural-language descriptions for high-degree (important) nodes. # Independent from community summaries — can enable one without the other. node_descriptions: # Enable/disable node description generation # Type: boolean # Default: true enabled: true # Degree threshold for node description generation # Type: number (0.0 – 1.0) # Default: 0.8 # Meaning: top (1 - threshold)% of nodes by degree get descriptions. # - 0.8 = top 20% of nodes # - 0.5 = top 50% of nodes # - 0.0 = all nodes (expensive!) # - 1.0 = no nodes threshold: 0.8 # LLM model for richer descriptions (optional) # Uses hf: URI notation — see the "Models" section for details. # Type: string | null # Default: null (algorithmic descriptions) # model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M" # Context window size for LLM inference (only used when model is set) # Type: number # Default: 512 context_size: 512 # ── Outlines ────────────────────────────────────────────────────────────────── # Tree-sitter 代码大纲：带有签名的函数、类、导入。 # 语言从文件扩展名自动检测 (无需指定)。 outlines: # Enable/disable outline generation # Type: boolean # Default: true enabled: true # Glob patterns for files to outline (relative to repo root) # Type: string[] # Default: ["src/**/*.ts", "src/**/*.py", "src/**/*.js"] patterns: - "src/**/*.py" - "src/**/*.ts" - "src/**/*.js" # Glob patterns to exclude from outline generation # Type: string[] # Default: ["**/node_modules/**", "**/.git/**", "**/dist/**"] exclude: - "**/__pycache__/**" - "**/node_modules/**" - "**/.git/**" - "**/dist/**" # ── 服务器 ──────────────────────────────────────────────────────────────────── # MCP 服务器选项 (保留供将来使用) # 类型：对象 # 默认值：{} server: {} ``` ### 最小配置大多数字段都有合理的默认值。适用于单仓库的最小配置如下： ``` output: ../reponova-out repos: - name: my-project path: .. ``` ### 多仓库配置 ``` output: ../reponova-out repos: - name: api path: ../services/api - name: core path: ../services/core - name: shared path: ../libs/shared ``` ### LLM 增强配置用于生成更丰富的、自然语言的社区摘要和节点描述： ``` output: ../reponova-out repos: - name: my-project path: .. models: gpu: auto # auto-detect GPU, falls back to CPU download_on_first_use: true build: community_summaries: enabled: true model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M" # ~350MB download node_descriptions: enabled: true threshold: 0.5 # describe top 50% nodes by degree model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M" # same model, auto-shared ``` ### 文件过滤配置控制哪些源文件包含在图谱中： ``` output: ../reponova-out repos: - name: my-project path: .. build: patterns: # only include files matching these globs - "src/**/*.py" - "lib/**/*.ts" exclude: # exclude files matching these globs - "**/test/**" - "**/tests/**" - "**/migrations/**" - "**/*.generated.ts" ``` ## 模型 RepoNova 使用两种类型的 AI 模型，两者都会在首次使用时自动下载并在本地缓存。无需 API 密钥，无需云服务。 ### ONNX 嵌入用于语义相似度搜索（`graph_similar`、`graph_context`）的 Sentence-transformer 模型。 | 属性 | 值 | |----------|-------| | **配置字段** | `build.embeddings.model` | | **表示法** | 纯模型名称（例如，`all-MiniLM-L6-v2`） | | **来源** | `huggingface.co/sentence-transformers/{model}` | | **缓存路径** | `{models.cache_dir}/{model-name}/` | | **下载的文件** | `model.onnx`, `vocab.txt`, `tokenizer_config.json` | | **何时需要** | `build.embeddings.method: onnx` | 兼容模型（均为 384 维，必须与 `embeddings.dimensions` 匹配）： | 模型 | 大小 | 备注 | |-------|------|-------| | `all-MiniLM-L6-v2` | ~86 MB | 默认。速度与质量的良好平衡 | | `all-MiniLM-L12-v2` | ~130 MB | 更准确，更慢 | | `paraphrase-MiniLM-L6-v2` | ~86 MB | 针对复述检测进行了优化 | | `multi-qa-MiniLM-L6-cos-v1` | ~86 MB | 针对问答进行了优化 | HuggingFace 上 `sentence-transformers/` 组织下任何提供 BERT 兼容分词器 (WordPiece) 的 ONNX 导出的模型都应该可以使用。`dimensions` 配置字段**必须**与模型的输出维度匹配。 ### LLM (GGUF) 本地语言模型，用于生成更丰富的社区摘要和节点描述，由 [node-llama-cpp](https://github.com/withcatai/node-llama-cpp) 提供支持。 | 属性 | 值 | |----------|-------| | **配置字段** | `build.community_summaries.model`, `build.node_descriptions.model` | | **表示法** | `hf:` URI（例如，`hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M`） | | **格式** | `hf:{user}/{repo}:{quantization}` | | **缓存路径** | `{models.cache_dir}/llm/` | | **何时需要** | 设置了 `community_summaries.model` 或 `node_descriptions.model` 时 | | **依赖** | `node-llama-cpp`（可选的对等依赖） | 当 `community_summaries.model` 和 `node_descriptions.model` 解析到同一个文件时，RepoNova 会共享一个引擎实例——不会产生双倍的内存消耗。 ### 模型管理 CLI ``` reponova models status # Show configured and cached models reponova models download # Pre-download all models needed by config reponova models remove # Remove a specific cached model reponova models clear # Remove all cached models ``` 当 `models.download_on_first_use: true`（默认值）时，模型也会在 `reponova build` 期间自动下载。这些 CLI 命令允许您独立于构建过程来管理缓存。 ## 构建输出运行 `reponova build` 后，输出目录包含： ``` reponova-out/ ├── graph.json # Full graph: nodes, edges, community assignments, metadata │ # metadata.build_config: config fingerprint for change detection │ # nodes include: docstring, signature, bases (when available) ├── graph.html # Interactive visualization (vis.js) — click, search, filter ├── graph_communities.html # Community-focused visualization with summary labels ├── graph_search.db # SQLite search index (sql.js WASM) — structural queries ├── report.md # Build report: stats, hotspots, community breakdown ├── community_summaries.json # Community summaries (algorithmic or LLM-enhanced) ├── node_descriptions.json # Descriptions for high-degree nodes ├── tfidf_idf.json # TF-IDF vocabulary weights (for query-time embedding) ├── vectors/ # LanceDB vector store — semantic similarity search │ └── (LanceDB internal files) # fallback: vectors.json when @lancedb/lancedb unavailable ├── outlines/ # Pre-computed code outlines per file │ └── /.outline.json └── .cache/ # Incremental build cache (SHA256 content hashing) ├── hashes.json # file path → SHA256 hex map ├── semantic-graph-hash.txt # SHA256 of graph structure (nodes + edges) ├── outline-hashes.json # file path → SHA256 map for outline generation ├── node-texts.json # node id → text hash map for incremental embeddings └── extractions/ # cached FileExtraction per file └── .json ``` 两个存储引擎服务于不同的目的： - **SQLite** (`graph_search.db`) — 用于精确查找、图谱遍历、FTS 的结构化索引。供 `graph_search`、`graph_impact`、`graph_path`、`graph_explain` 等使用。 - **LanceDB** (`vectors/`) — 用于语义相似度的向量索引。供 `graph_similar` 和 `graph_context` 使用。当未安装 `@lancedb/lancedb` 时，会回退到暴力余弦相似度（JSON）。 ## 编程 API 在您自己的 Node.js 工具中将 RepoNova 作为库使用。 ### 构建 API 以编程方式运行完整的构建管道——适用于 CI 集成、自定义工具或在构建前注册自定义提取器/语言的工作流。 ``` import { build } from "reponova"; const result = await build("./reponova.yml"); console.log(`Built: ${result.nodeCount} nodes, ${result.edgeCount} edges`); console.log(`Output: ${result.outputDir}`); ``` ``` // Force rebuild (deletes output and rebuilds from scratch) const result = await build("./reponova.yml", { force: true }); ``` `build()` 返回一个 `BuildResult`： | 字段 | 类型 | 描述 | |-------|------|-------------| | `outputDir` | `string` | 输出目录的绝对路径 | | `fileCount` | `number` | 处理的源文件数量 | | `nodeCount` | `number` | 图谱中的节点数量 | | `edgeCount` | `number` | 图谱中的边数量 | | `communityCount` | `number` | 检测到的社区数量 | 如果省略了 `configPath`，则会从标准位置自动检测配置（参见[配置解析](#config-resolution))。 ### 运行时注册 + 构建在调用 `build()` **之前**注册自定义提取器或大纲语言： ``` import { build, registerExtractor, registerOutlineLanguage, } from "reponova"; import type { LanguageExtractor, LanguageSupport } from "reponova"; // 1. Register a custom extractor (graph building) const myExtractor: LanguageExtractor = { /* ... */ }; registerExtractor(myExtractor); // 2. Register outline support (graph_outline) const myOutline: LanguageSupport = { /* ... */ }; registerOutlineLanguage("rust", ["rs"], myOutline); // 3. Build — all registrations are picked up automatically const result = await build("./reponova.yml"); ``` ### 查询 API 构建完成后，加载并查询图谱： ``` import { openDatabase, initializeSchema, populateDatabase, loadGraphData, searchNodes, analyzeImpact, findShortestPath, getNodeDetail, } from "reponova"; // Load and index the graph const graphData = loadGraphData("./reponova-out/graph.json"); const db = await openDatabase(":memory:"); initializeSchema(db); populateDatabase(db, graphData); // Search const results = searchNodes(db, "authentication", { top_k: 5, type: "function" }); // Impact analysis const impact = analyzeImpact(db, "Function:authenticate_user", { max_depth: 3 }); // Shortest path const path = findShortestPath(db, graphData, "ModuleA", "ModuleB"); // Node detail const detail = getNodeDetail(db, graphData, "Function:process_payment"); ``` ### 高级 API ``` import { ContextBuilder, loadConfig, } from "reponova"; // Smart context assembly (search + vectors + graph expansion) const { config } = loadConfig("./reponova.yml"); const builder = new ContextBuilder(db, graphData, "./reponova-out"); await builder.initialize(config.build.embeddings); const context = await builder.buildContext({ query: "authentication flow", maxTokens: 4000, }); ``` ## 常见问题 ### 我需要 API 密钥吗？不需要。一切都在本地运行。可选的 LLM 是本地模型 (Qwen 0.5B)——不需要云，不需要 API 密钥，数据不会离开您的机器。 ### 模型有多大？ | 模型 | 大小 | 何时下载 | |-------|------|----------------| | TF-IDF 嵌入 | 无（在进程内计算） | 从不 | | ONNX 嵌入 | ~86 MB (MiniLM-L6-v2) | 首次使用 `method: onnx` 构建时 | | LLM (Qwen 0.5B Q4_K_M) | ~350 MB | 当设置了 `community_summaries.model` 或 `node_descriptions.model` 时 | ### 构建需要多长时间？取决于代码库的大小。大致基准如下： - 小型项目（500 个文件）：约 5-10 秒 - 中型项目（5,000 个文件）：约 30-60 秒 - 大型 monorepo（20,000+ 个文件）：2-5 分钟 - LLM 摘要每个社区增加约 2-3 秒 ### 我可以在没有编辑器的情况下使用它吗？可以。使用 CLI（`reponova build`、`reponova check`）和编程 API。MCP 服务器只是查询图谱的一种方式。 ### TypeScript / JavaScript 提取支持吗？ Tree-sitter 语法已准备就绪。提取器的实现已在路线图上——欢迎贡献。 ## 贡献欢迎贡献。 ### 添加语言支持（提取）通过 tree-sitter 添加新的编程语言提取器。提取器教会 RepoNova 如何解析一种语言的 AST，并为图谱构建提取符号、导入和引用。 #### 步骤 1. **创建** `src/extract/languages/.ts`，实现 `LanguageExtractor` 接口 2. 在 `src/extract/languages/registry.ts` 中**注册**它（或在运行时通过 `registerExtractor()` 注册） 3. **将** tree-sitter WASM 语法添加到 `grammars/`（例如，`tree-sitter-javascript.wasm`） #### `LanguageExtractor` 接口 ``` interface LanguageExtractor { /** Language identifier — must match tree-sitter grammar name (e.g., "javascript") */ readonly languageId: string; /** File extensions this extractor handles (e.g., [".js", ".mjs", ".cjs"]) */ readonly extensions: string[]; /** * WASM grammar filename (e.g., "tree-sitter-javascript.wasm"). * If provided: pipeline parses with tree-sitter and passes the SyntaxTree. * If omitted: extract() receives a null tree and must work from sourceCode directly. * (Markdown and diagram extractors use this — no WASM needed.) */ readonly wasmFile?: string; /** * Extract symbols, imports, and references from a single source file. * @param tree - Parsed tree-sitter AST (null if wasmFile not set) * @param sourceCode - Raw file content * @param filePath - Relative path (normalized, forward slashes) */ extract(tree: SyntaxTree | null, sourceCode: string, filePath: string): FileExtraction; /** * Resolve an import module path to candidate file paths. * Example: "config.loader" → ["config/loader.py", "config/loader/__init__.py"] * Return empty array for external/third-party modules. */ resolveImportPath(importModule: string, currentFilePath: string): string[]; } ``` #### `FileExtraction` 返回类型 ``` interface FileExtraction { filePath: string; // Relative path (forward slashes) language: string; // Must match languageId symbols: SymbolNode[]; // Functions, classes, methods, variables imports: ImportDeclaration[]; // Import/export statements references: SymbolReference[]; // Calls, type annotations, inheritance refs } ``` **您的提取器产生的关键类型：** | 类型 | 字段 | 用途 | |------|--------|---------| | `SymbolNode` | `name`, `qualifiedName`, `kind`, `signature?`, `decorators`, `docstring?`, `startLine`, `endLine`, `parent?`, `bases?`, `calls` | 在文件中定义的符号 | | `ImportDeclaration` | `module`, `names`, `isWildcard`, `isExport?`, `line` | 导入/导出语句 | | `SymbolReference` | `name`, `fromSymbol`, `kind` (`"call"` \| `"type_annotation"` \| `"attribute_access"` \| `"inheritance"`), `line` | 对另一个符号的引用 | | `SymbolKind` | `"function"` \| `"class"` \| `"method"` \| `"variable"` \| `"constant"` \| `"interface"` \| `"enum"` \| `"module"` \| `"document"` \| `"section"` | 符号分类 | 完整的类型定义和 JSDoc 请参见 `src/extract/types.ts`。 #### tree-sitter 解析的工作原理 1. 如果设置了 `wasmFile`，管道会加载 `grammars/`，解析源码，并将 `SyntaxTree` 传递给 `extract()` 2. 如果省略了 `wasmFile`，`extract()` 将接收 `null` 作为树，并且必须直接从 `sourceCode` 进行工作 3. WASM 语法从相对于包根目录的 `grammars/` 目录中加载 4. `SyntaxTree` / `SyntaxNode` 类型与 [web-tree-sitter](https://github.com/nicolo-ribaudo/tree-sitter-wasm-prebuilt) 的 WASM 接口相匹配 #### 运行时注册您也可以通过公共 API 在运行时注册提取器（必须在 `build` 之前调用）： ``` import { registerExtractor } from "reponova"; import type { LanguageExtractor } from "reponova"; const myExtractor: LanguageExtractor = { /* ... */ }; registerExtractor(myExtractor); ``` 注意：重复的 `languageId` 或 `extensions` 会静默覆盖以前的提取器。 #### 参考实现完整的基于 tree-sitter 的提取器请参见 `src/extract/languages/python.ts`，非 tree-sitter（正则表达式）的提取器请参见 `src/extract/languages/markdown.ts`。 ### 添加大纲支持大纲（`graph_outline`）使用的是与提取**独立的系统**。它们有自己的注册表、接口和实现。 #### 步骤 1. **创建** `src/outline/languages/.ts`，实现 `LanguageSupport` 接口 2. 通过 `registerOutlineLanguage()` 在 `src/outline/languages/registry.ts` 中**注册**它 3. `grammars/` 中的相同 WASM 语法与提取系统共享 #### `LanguageSupport` 接口 ``` interface LanguageSupport { /** WASM grammar filename (e.g., "tree-sitter-python.wasm") */ readonly wasmFile: string; /** Extract outline from tree-sitter AST (primary method) */ treeSitterExtract(rootNode: SyntaxNode, filePath: string, lineCount: number): FileOutline; /** Extract outline from raw source via regex (fallback when WASM unavailable) */ regexExtract(filePath: string, source: string, lineCount: number): FileOutline; } ``` #### 运行时注册您也可以通过公共 API 在时注册大纲语言（必须在 `build` 之前调用）： ``` import { registerOutlineLanguage } from "reponova"; import type { LanguageSupport } from "reponova"; const myOutline: LanguageSupport = { /* ... */ }; registerOutlineLanguage("rust", ["rs"], myOutline); ``` 注意：重复的语言 `names` 或 `extensions` 会静默覆盖以前的注册。参考实现请参见 `src/outline/languages/python.ts`。 ## 许可证 MIT — [CristianoCiuti/reponova](https://github.com/CristianoCiuti/reponova)

标签：AI编程助手, DLL 劫持, DNS解析, GNU通用公共许可证, LLM集成, MCP, MITM代理, Model Context Protocol, Node.js, 人工智能辅助编程, 代码依赖分析, 代码分析, 代码图谱, 代码审查, 代码库解析, 代码搜索, 代码索引, 代码语义分析, 凭证管理, 图谱构建, 大语言模型, 威胁情报, 开发者工具, 开源项目, 影响分析, 最短路径算法, 社区发现, 自动化攻击, 语义相似度, 错误基检测, 静态代码分析