codegraph-ai/CodeGraph
GitHub: codegraph-ai/CodeGraph
Stars: 17 | Forks: 1
# CodeGraph
**Cross-language code intelligence for AI agents and developers.**
[](LICENSE)
CodeGraph builds a semantic graph of your codebase — functions, classes, imports, call chains — and exposes it through **45 MCP tools**, a **VS Code extension**, and a **persistent memory layer**. Parses **37 languages** via tree-sitter. AI agents get structured code understanding instead of grepping through files.
## Quick Start
### MCP Server (Claude Code, Cursor, any MCP client)
Add to `~/.claude.json` (or your MCP client config):
{
"mcpServers": {
"codegraph": {
"command": "/path/to/codegraph-server",
"args": ["--mcp"]
}
}
}
The server indexes the current working directory automatically.
### VS Code Extension
Install the VSIX:
code --install-extension codegraph-0.14.0.vsix
The extension starts the server automatically and registers all tools as Language Model Tools for Copilot.
### Rules for AI agents
Pre-configured rule files that teach AI coding agents (Claude, Cursor,
Windsurf, Codex, Cline) to use CodeGraph MCP tools before falling back
to grep / multi-file reads. Maps natural-language intent to the right
`codegraph_*` tool.
→ **[codegraph-ai/codegraph-rules-for-agents](https://github.com/codegraph-ai/codegraph-rules-for-agents)**
Setup is `cp /codegraph.md ~//` (one line per agent — see
the rules repo's README).
### GitHub Action — PR review in CI
Drop a workflow into your repo to get an automatic code-graph analysis
comment on every PR — blast radius, test gaps, stale docs, suggested
reviewers. Runs **graph-only** (no embeddings, no ONNX model), so it's
fast and needs no API keys — just the built-in `GITHUB_TOKEN`.
Copy [`.github/workflows/codegraph-pr.yml`](.github/workflows/codegraph-pr.yml)
into your repo. The core invocation is a single command:
codegraph-server --graph-only \
--run-tool codegraph_pr_context \
--tool-args '{"baseBranch":"main","format":"markdown"}'
This prints a ready-to-post markdown comment. The `--graph-only` flag
skips embedding generation (10-50× faster indexing); `--run-tool` runs
one tool and exits without the MCP stdio handshake — ideal for scripting.
## Configuration
### MCP Server flags
| Flag | Default | Description |
|------|---------|-------------|
| `--workspace ` | current dir | Directories to index (repeatable for multi-project) |
| `--exclude ` | — | Directories to skip (repeatable) |
| `--embedding-model ` | `bge-small` | `bge-small` (384d, fast), `jina-code-v2` (768d, 6× slower), or `granite-97m` (384d, 32K ctx, ~3× slower) |
| `--full-body-embedding` | `true` | Embed full function body (~50 lines) for better semantic search and duplicate detection |
| `--max-files ` | 5000 | Maximum files to index |
| `--profile ` | `all` | Filter the exposed MCP tool surface to a named subset (see below) |
| `--graph-only` | off | Skip embedding generation — build the graph and serve structural tools only. No ONNX model load, 10-50× faster indexing. Semantic search unavailable. For CI / one-shot graph queries. |
| `--run-tool ` | — | One-shot mode: index, run a single tool, print its result, exit. No MCP handshake. Pair with `--tool-args ''`. |
#### `--profile` — narrow the MCP tool surface
The full 32-tool surface is convenient but inflates the agent's prompt-context cost. A profile exposes only the slice you need (also settable via the `CODEGRAPH_TOOL_PROFILE` env var):
| Profile | Tools | Use when |
|---------|-------|----------|
| `all` *(default)* | every tool (community + pro) | normal sessions |
| `core` | 8 — search + symbol info + AI context | chatty agent sessions where you only need lookups |
| `graph` | 16 — callers/callees/deps/impact/traverse | refactoring + structural analysis |
| `memory` | 7 — `codegraph_memory_*` only | note-taking / knowledge-base workflows |
| `security` | pro security tools only (empty on community) | pro security audits |
### VS Code settings
{
"codegraph.indexOnStartup": true,
"codegraph.indexPaths": ["/path/to/project-a", "/path/to/project-b"],
"codegraph.excludePatterns": ["**/cmake-build-debug/**", "**/generated/**"],
"codegraph.embeddingModel": "bge-small",
"codegraph.maxFileSizeKB": 1024,
"codegraph.debug": false
}
Full-body embeddings are enabled by default. Function body text is captured at parse time with zero I/O overhead.
Built-in exclusions (always skipped) cover ~47 directories across three categories:
- **Build / cache**: `node_modules`, `target`, `dist`, `build`, `out`, `.git`, `__pycache__`, `vendor`, `.venv`, `venv`, `.tox`, `.pytest_cache`, `.mypy_cache`, `.ruff_cache`, `.next`, `.nuxt`, `.svelte-kit`, `.parcel-cache`, `.npm`, `.yarn`, `.pnpm-store`, `.cache`, `.cargo`, `.bundle`, `.gradle`, `DerivedData`, `Pods`, `xcuserdata`, `cmake-build-*`
- **IDE / IaC state**: `.idea`, `.vscode-test`, `.fleet`, `.terraform`, `.terragrunt-cache`, `.serverless`
- **Sensitive credential dirs**: `.aws`, `.ssh`, `.gnupg`, `.kube`, `.docker`
Plus glob patterns for binary archives, native libraries, OS metadata, and **secret file extensions** (`*.pem`, `*.key`, `*.p12`, `*.pfx`, `*.crt`, `*.gpg`, `*.kdbx`, SSH key conventions like `id_rsa`, etc.) — defense in depth against accidentally embedding credentials.
### Code Analysis (11)
| Tool | What it does |
|------|-------------|
| `get_ai_context` | **Primary context tool.** Intent-aware (explain/modify/debug/test) with token budgeting. Returns source, related symbols, imports, siblings, debug hints. |
| `get_edit_context` | Everything needed before editing: source + callers + tests + memories + git history |
| `get_curated_context` | Cross-codebase context for a natural language query ("how does auth work?") |
| `analyze_impact` | Blast radius prediction — what breaks if you modify, delete, or rename |
| `analyze_complexity` | Cyclomatic complexity with breakdown (branches, loops, nesting, exceptions, early returns) |
| `find_circular_deps` | Detect circular import/dependency chains across files |
| `find_hot_paths` | Most-called functions ranked by transitive caller count |
| `find_dead_imports` | Find unused imports — modules imported but never referenced |
| `get_module_summary` | High-level summary of a directory: file count, functions, language breakdown, top complex functions |
| `search_by_pattern` | Regex search across function bodies, signatures, names, and docstrings |
| `search_by_error` | Find functions that throw, catch, or handle specific error types |
### Code Navigation (13)
| Tool | What it does |
|------|-------------|
| `symbol_search` | Find symbols by name or natural language (hybrid BM25 + semantic search) |
| `get_callers` / `get_callees` | Who calls this? What does it call? (with transitive depth) |
| `get_detailed_symbol` | Full symbol info: source, callers, callees, complexity |
| `get_symbol_info` | Quick metadata: signature, visibility, kind |
| `get_dependency_graph` | File/module import relationships with depth control |
| `get_call_graph` | Function call chains (callers and callees) |
| `find_by_imports` | Find files importing a module |
| `find_by_signature` | Search by param count, return type, modifiers |
| `find_entry_points` | Main functions, HTTP handlers, CLI commands, event handlers |
| `find_implementors` | Find all functions registered as ops struct callbacks |
| `find_related_tests` | Tests that exercise a given function |
| `traverse_graph` | Custom graph traversal with edge/node type filters |
### Indexing (3)
| Tool | What it does |
|------|-------------|
| `reindex_workspace` | Full or incremental workspace reindex |
| `index_files` | Add/update specific files without full reindex |
| `index_directory` | Add directory to graph alongside existing data |
### Memory (7)
Persistent AI context across sessions — debugging insights, architectural decisions, known issues.
| Tool | What it does |
|------|-------------|
| `memory_store` / `memory_get` / `memory_search` | Store, retrieve, search memories (BM25 + semantic) |
| `memory_context` | Get memories relevant to a file/function |
| `memory_list` / `memory_invalidate` / `memory_stats` | Browse, retire, monitor |
Pairs well with [Tempera](https://github.com/anvanster/tempera) — an episodic memory system that captures transferable debugging strategies and solutions across projects. CodeGraph's memory tools store project-scoped notes; Tempera captures cross-project BKMs (best-known methods) that improve over time.
### PR / Change Analysis (1)
| Tool | What it does |
|------|-------------|
| `pr_context` | **One-call PR review.** Runs git diff against base branch, finds changed functions in the graph, reports: blast radius (callers), test coverage + gaps, affected modules, diff-aware change classification (signature vs body), stale-doc warnings, complexity, commit-message hint, suggested reviewers from git blame. |
### Documentation (7)
Persistent project documentation — index design docs, search them semantically, verify code matches the design, generate architecture docs from the code graph.
| Tool | What it does |
|------|-------------|
| `index_markdown` | Index a local `.md` file (ARCHITECTURE.md, API_DESIGN.md, etc.) into the persistent docs store. Heading-tree chunking with leaf-node embeddings. |
| `search_docs` | Semantic search over indexed docs — returns matching sections with heading-path breadcrumbs |
| `list_doc_sources` | List all indexed source files |
| `remove_doc_source` | Remove all indexed chunks from a source file |
| `verify_design` | Cross-reference doc claims vs code graph. `direction=forward` (doc→code), `reverse` (code→doc), or `both` |
| `design_gaps` | Find identifiers described in docs that don't exist in code yet — build TODO lists from specs |
| `generate_architecture_doc` | Auto-generate a structured ARCHITECTURE.md from the live code graph (modules, hot paths, complexity, circular deps) |
All tool names are prefixed with `codegraph_` (e.g. `codegraph_get_ai_context`). Tools that target a specific symbol accept `uri` + `line` or `nodeId` from `symbol_search` results.
### Usage examples
**Index a design doc and search it:**
codegraph_index_markdown(path: "/projects/myapp/docs/ARCHITECTURE.md")
codegraph_search_docs(query: "how does the auth module handle JWT refresh?")
**Check if the code matches the design:**
codegraph_verify_design(source: "/projects/myapp/docs/ARCHITECTURE.md", direction: "forward")
// → "132/132 identifiers verified, 0 gaps"
**Find what's described in docs but not yet implemented:**
codegraph_design_gaps(source: "/projects/myapp/docs/API_DESIGN.md")
// → "4 of 12 identifiers not found in code: PaymentService, RefundHandler, ..."
**Generate architecture docs from the code graph:**
codegraph_generate_architecture_doc(scope: "src/", topN: 5)
// → Markdown with modules, complexity hotspots, hot paths, circular deps
**Save a debugging insight for future sessions:**
codegraph_memory_store(kind: "debug_context", title: "Nginx body size limit",
content: "The /upload endpoint fails on payloads > 1MB...",
problem: "API returns 500 on large uploads",
solution: "Increase nginx client_max_body_size to 10M",
agentSource: "claude")
**Get AI context with graph compression stats + design doc augmentation:**
codegraph_get_ai_context(uri: "file:///projects/myapp/src/auth.rs", line: 42, intent: "modify")
// → Code context + graphStats: {entitiesInGraph: 13555, entitiesTraversed: 47, entitiesKept: 8}
// → design_context section from indexed docs mentioning "auth"
**Review a PR — blast radius, test gaps, stale docs, reviewers in one call:**
codegraph_pr_context(baseBranch: "main")
// → "PR changes 4 files (+263/-77, 12 functions). 37 direct callers, 8 tests, 3 untested. Risk: medium."
// → test_gaps: [refresh_token, revoke_session] — functions with 0 test callers
// → stale_docs: ["auth.rs described in ARCHITECTURE.md > Authentication — doc may need updating"]
// → suggested_reviewers: [{author: "anvanster", lines_owned: 3200}]
// → commit_hint: "feat(mcp): "
**Narrow the tool surface for chatty sessions:**
codegraph-server --mcp --profile=core # Only 8 tools: search + symbol info + AI context
### CodeGraph Pro
Additional tools available in [CodeGraph Pro](https://codegraph.astudioplus.com/pro):
| Tool | What it does |
|------|-------------|
| `scan_security` | Security vulnerability scan: 40+ dangerous function patterns, source-to-sink taint tracing, auth coverage for HTTP endpoints (7 languages/frameworks), architectural layer violations, weak crypto, hardcoded secrets |
| `analyze_coupling` | Module coupling metrics and instability scores |
| `find_unused_code` | Dead code detection with confidence scoring |
| `find_duplicates` | Detect duplicate/near-duplicate functions |
| `find_similar` / `cluster_symbols` / `compare_symbols` | Embedding-based code similarity |
| `cross_project_search` | Search across all indexed projects |
| `mine_git_history` / `mine_git_history_for_file` / `search_git_history` | Git history mining and semantic search |
| `security_control_flow` | Map every execution path through a function — "can this return without hitting the auth check?" |
| `security_trace_data_flow` | Follow a variable from birth to death — "does user input reach this SQL query?" |
| `security_generate_sbom` | CycloneDX SBOM from 8 lockfile formats |
| `security_audit_deps` | OSV vulnerability check on dependencies |
| `security_check_unchecked_returns` / `_resource_leaks` / `_misconfig` / `_input_validation` / `_error_exposure` | 5 heuristic analyzers covering ~80% of CWE Top 25 |
| `security_scan_iac` | Docker / Kubernetes / Terraform misconfiguration scan |
| `security_check_licenses` | Lockfile license policy enforcement (copyleft detection) |
| `security_check_secrets_entropy` | Shannon-entropy hardcoded-secret detection |
| `security_detect_injection` | Focused SQL/XSS/cmd/path/deser/template injection detection (20 patterns) |
| `security_check_search_path` | Untrusted search-path / DLL-hijacking detection (CWE-426/CWE-427) |
| `security_check_crypto` | Cryptographic misuse: weak ciphers/hashes/PRNG/keys, static IVs, timing-leak comparisons (CWE-208/326-330/338/916, 35 patterns) |
| `security_export_sarif` | Aggregate findings as SARIF 2.1.0 (GitHub Code Scanning, GitLab SAST) |
**Cross-cutting features (all `security_check_*` tools):**
- `include_tests` / `treat_as_production` — first-class skip for tests/samples/vendored
- `check_compile_gates` — C/C++ findings inside `#ifdef X` are marked DEFENSIVE_GATED_OFF when X isn't defined by CMake/Cargo/Makefile
- 25-marker suppression honoring (`# nosec`, `// NOLINT`, `// codeql[ignore]`, `# rubocop:disable`, etc.) at line and function level
- Telemetry blocks per scan: `path_filter` (examined/matched/skipped) + `compile_gate` (gated_off count)
## Languages
38 languages parsed via tree-sitter — functions, classes, imports, call graph, complexity metrics, dependency graphs, symbol search, and impact analysis:
| Category | Languages |
|---|---|
| **Systems** | C, C++, Rust, Zig, Objective-C |
| **JVM** | Java, Kotlin, Scala, Groovy, Clojure |
| **Web/Scripting** | TypeScript/JS, Python, Ruby, PHP, Perl, Lua, Elixir, Elm |
| **Web/Style** | CSS |
| **Mobile** | Swift, Dart |
| **Functional** | Haskell, OCaml, Julia, Erlang, Elm, Clojure |
| **Enterprise** | C#, COBOL, Fortran, Go |
| **Blockchain** | Solidity |
| **Shell/Config** | Bash, HCL/Terraform, TOML, YAML |
| **Hardware** | Verilog/SystemVerilog, Tcl |
| **Data Science** | R, Julia |
HTTP handler detection: Python (FastAPI/Flask/Django), TypeScript (NestJS), Java (Spring/JAX-RS), Go (stdlib/Gin/Echo/Fiber), C# (ASP.NET), Ruby (Rails), PHP (Laravel/Symfony).
## Architecture
MCP Client (Claude, Cursor, ...) VS Code Extension
| |
MCP (stdio) LSP Protocol
| |
└───────────┐ ┌───────────┘
▼ ▼
┌─────────────────────────────┐
│ codegraph-server │
├─────────────────────────────┤
│ 38 tree-sitter parsers │
│ Semantic graph engine │
│ AI query engine (BM25) │
│ Memory layer (RocksDB) │
│ Docs store (RocksDB+HNSW) │
│ Full-body embeddings (BGE) │
│ HNSW vector index │
└─────────────────────────────┘
A single Rust binary serves both MCP and LSP protocols.
- **Indexing**: ~60 files/sec. Incremental re-indexing on file changes via FNV-1a content hashing.
- **Persistence**: Graph and embeddings persist to `~/.codegraph/graph.db` (RocksDB). Instant startup on restart — no re-parsing, no re-embedding.
- **Queries**: Sub-100ms. Cross-file import and call resolution at index time.
- **Embeddings**: Full-body (function bodies captured at parse time, zero disk I/O). Vectors stored in RocksDB alongside the graph. Auto-downloads model on first run.
## Building from Source
git clone https://github.com/codegraph-ai/codegraph
cd codegraph
cargo build --release -p codegraph-server # Rust server
cd vscode && npm install && npm run esbuild # VS Code extension
npx @vscode/vsce package # VSIX
Requires Rust stable, Node.js 18+, VS Code 1.90+.
## License
Apache-2.0
标签:客户端加密