ryanking3/nullcontext-runner
GitHub: ryanking3/nullcontext-runner
Stars: 4 | Forks: 0
# NullContext
NullContext is a local-first secure inference environment for running LLM sessions with explicit lifecycle visibility, audit reporting, configurable persistence behavior, and local browser-based runtime inspection.
The project currently targets local inference workflows using:
- Rust
- llama.cpp
- Axum
- React
- local GGUF models
- CUDA acceleration (Windows)
- browser-based localhost UI
NullContext is designed around the idea that local inference systems should expose:
- what was stored
- what was retained
- what was deleted
- what cleanup operations occurred
- what residual risks remain
rather than treating local inference as an opaque black box.
## Current Architecture
Browser UI
↓
Local Axum API server
↓
NullContext runtime
↓
llama.cpp
↓
Local GGUF model
The entire stack runs locally.
No cloud inference is required.
## Current Features
### Local Inference Runtime
- llama.cpp backend integration
- local GGUF model support
- stdin-based prompt ingestion
- one-shot streaming inference
- one-shot corpus-grounded retrieval
- active chat sessions with runtime reuse
- active-chat corpus-grounded retrieval
- configurable inference modes
- persistent and ephemeral sessions
- configurable token limits
- configurable GPU offload
- Windows CUDA support
- local HTTP API server
### Security / Privacy Features
- explicit workspace lifecycle management
- recursive artifact scanning
- Rust-owned buffer zeroization
- RAM zeroization verification
- llama runtime exposure reporting
- live llama runtime RAM/VRAM usage snapshots
- post-shutdown llama runtime inspection
- macOS vmmap-based RAM inspection and resident-region delta analysis
- Windows PowerShell-based process memory observation
- NVIDIA `nvidia-smi` compute-apps and `pmon` fallback inspection paths
- audit operation tracking
- sanitization operation reporting
- structured privacy reports
- configurable retention behavior
- manual cleanup and reconcile actions for retained sessions
- scheduled retention expiry cleanup
- startup lifecycle reconciliation for orphaned sessions/workspaces
- lifecycle-aware privacy reporting
- corpus lifecycle cleanup and reconcile actions
- corpus retention policy controls
- corpus report syncing after lifecycle changes
- explicit End + Sanitize workflow for active chat
- residual risk reporting for long-lived runtimes
### Session Registry
Persistent sessions are indexed locally:
~/.nullcontext/index.json
The registry tracks:
- session IDs
- timestamps
- security mode
- selected model IDs and names
- workspace paths
- report paths
- cleanup state
- lifecycle state
- retention policies and deadlines
- artifact counts
### Model Registry
The local model registry supports:
- default model selection
- named model IDs
- per-model token, GPU, template, and context defaults
- model switching in the browser UI and API
- model file validation
- llama-server runtime readiness reporting
### Corpus Registry
The local corpus registry supports:
- txt, markdown, and pdf ingestion
- hybrid pdf extraction with OCR for sparse pages
- browser-native file upload ingestion with drag/drop and upload progress
- persistent and ephemeral corpora
- local chunking and embedding artifacts
- direct corpus querying through the API
- one-shot and active-chat grounding
- corpus lifecycle cleanup, reconcile, and retention controls
- startup lifecycle reconciliation for orphaned corpora
- retained ingestion reports with lifecycle metadata
- structured corpus report viewing with optional raw JSON inspection
### Local Web UI
The current browser UI supports:
- one-shot prompt execution
- one-shot corpus selection for grounded runs
- active chat session start, stream, stop, and end
- active chat corpus binding for grounded sessions
- dedicated model registry browser
- dedicated corpus registry browser
- path-based corpus ingestion
- browser-native corpus file upload ingestion
- chatbot-style composer uploads for txt/md/pdf grounding corpora
- structured corpus report viewing inside the corpus browser
- model selection for one-shot and active chat
- model-default versus manual-override controls
- selectable active chat prompt template
- configurable active chat context token budget and turn limit
- runtime lifecycle visualization
- audit operation inspection
- privacy report inspection
- runtime log inspection
- persistent session browsing
- dark/light terminal-style UI
- before-unload warning while active chat runtime is live
- local-only API interaction
- localhost-only execution
## Security Modes
### secure
Default mode.
Characteristics:
- ephemeral workspace
- automatic cleanup
- audit reporting
- artifact scanning
- buffer sanitization
- stdin prompt ingestion recommended
### standard
Allows persistent sessions.
Characteristics:
- retained workspace
- retained reports
- session registry indexing
### air-gapped
Reserved for stricter future runtime policies.
Currently behaves similarly to secure mode.
## Runtime Lifecycle
A typical session lifecycle:
1. Prompt ingestion
2. Runtime launch
3. Local inference
4. Artifact scan
5. Audit operation emission
6. Buffer sanitization
7. Workspace cleanup or retention
8. Privacy report generation
9. Session indexing (persistent only)
### One-shot mode
One prompt creates a full lifecycle:
create session
→ launch llama-server
→ stream completion
→ shutdown runtime
→ scan artifacts
→ sanitize Rust-owned buffers
→ cleanup or retain workspace
→ emit privacy report
### Active chat mode
A chat session creates a long-lived runtime:
start active session
→ launch llama-server once
→ send multiple messages through same runtime
→ keep chat context in memory until session end
→ end session explicitly
→ shutdown runtime
→ zeroize Rust-owned chat history
→ scan artifacts
→ cleanup or retain workspace
→ emit privacy report
Active chat uses:
- model-aware prompt templates
- bounded recent-context management
- audit visibility when older turns are dropped from the prompt window
- optional bound corpus retrieval on every turn
## Current API
The local API server currently exposes:
### Health
GET /api/health
### Run Session
POST /api/run
Runs a non-streaming one-shot session and returns collected stdout/stderr.
### Stream Run Session
POST /api/run/stream
Example body:
{
"prompt": "Explain secure local inference.",
"mode": "secure",
"persistent": false,
"model_id": "",
"corpus_id": "",
"chat_template": "auto",
"chat_context_token_budget": 2048,
"chat_context_turn_limit": 12
}
When `corpus_id` is present, `/api/run/stream` retrieves local corpus context first and injects a grounded prompt wrapper before inference.
### Corpus Registry
GET /api/corpora
POST /api/corpora
GET /api/corpora/:corpus_id/report
POST /api/corpora/:corpus_id/query
POST /api/corpora/:corpus_id/retention
POST /api/corpora/:corpus_id/cleanup
POST /api/corpora/:corpus_id/reconcile
Example ingest body:
{
"name": "incident-briefing",
"paths": [
"/Users/you/docs/briefing.pdf",
"/Users/you/docs/notes"
],
"persistent": true,
"ocr_enabled": true
}
### Model Registry
GET /api/models
### Start Active Chat Session
POST /api/chat/start
Example body:
{
"mode": "secure",
"persistent": false,
"model_id": "",
"corpus_id": "",
"chat_template": "auto",
"chat_context_token_budget": 2048,
"chat_context_turn_limit": 12
}
### Active Chat Status
GET /api/chat/:session_id/status
### Stream Active Chat Message
POST /api/chat/:session_id/message/stream
Example body:
{
"prompt": "Explain secure local inference in 2 short bullet points."
}
### Active Chat Template And Context Fields
- `model_id`
Selects a registered model by ID
- `corpus_id`
Binds a registered local corpus by ID for grounded retrieval
- `chat_template`
Values: `auto`, `generic`, `chatml`, `llama3-instruct`
- `chat_context_token_budget`
Approximate token budget for recent active-chat context selection
- `chat_context_turn_limit`
Maximum number of recent prior turns to include in active-chat context
When `chat_template` is `auto`, NullContext resolves a template from the selected model path.
If the UI is using model defaults, it omits these override fields and lets the selected model drive the effective template and context settings.
If `corpus_id` is provided when starting active chat, NullContext binds that corpus for retrieval on every subsequent turn until the session ends.
### End Active Chat Session
POST /api/chat/:session_id/end
### Cancel Active Chat Generation
POST /api/chat/:session_id/cancel
### List Sessions
GET /api/sessions
### Update Session Retention Policy
POST /api/sessions/:session_id/retention
### Cleanup Retained Session
POST /api/sessions/:session_id/cleanup
### Reconcile Session Lifecycle State
POST /api/sessions/:session_id/reconcile
### Show Report
GET /api/reports/:session_id
### Streaming Event Types
Streaming endpoints emit SSE-style `data:` blocks containing JSON events. Current event types include:
- `runtime`
- `audit`
- `model`
- `report`
- `stderr`
- `error`
- `complete`
## Current Limitations
NullContext does not currently guarantee:
- VRAM sanitization
- llama.cpp internal allocator sanitization
- OS swap sanitization
- shell history sanitization
- cross-process memory sanitization
- CUDA memory sanitization
- forensic memory clearing outside Rust-owned buffers
- perfect PDF layout reconstruction
- OCR accuracy for every scanned or image-only PDF
Active chat also keeps a long-lived llama.cpp runtime and in-memory context alive until the user explicitly ends the session.
Corpus ingestion can recover text from many PDFs, including scanned pages via OCR, but complex layouts, tables, and poor scans may still extract imperfectly.
NullContext now performs best-effort llama runtime inspection, including shutdown-path reporting, live RAM/VRAM observation, post-shutdown verification, macOS `vmmap` inspection when available, and Windows/NVIDIA fallback observation paths. These inspections improve visibility, but they are not proof of allocator zeroization or full RAM/VRAM sanitization.
On Windows in particular, PowerShell process metrics and NVIDIA tooling can still be incomplete or driver-mode dependent, especially for per-process VRAM visibility under WDDM.
The privacy reports intentionally expose these residual risks.
## Development Setup
### Requirements
### Windows
- Rust
- Node.js
- pnpm
- Visual Studio Build Tools
- CUDA Toolkit
- llama.cpp
- local GGUF model
### macOS
- Rust
- Node.js
- pnpm
- Xcode Command Line Tools
- llama.cpp
- local GGUF model
## llama.cpp Setup
Clone:
git clone https://github.com/ggml-org/llama.cpp
### Windows CUDA Build
From:
x64 Native Tools Command Prompt for VS
Run:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Expected binaries:
build/bin/Release/llama-server.exe
build/bin/Release/llama-cli.exe
### macOS Build
cmake -B build
cmake --build build --config Release
## Configuration
Configuration file:
~/.nullcontext/config.toml
Example model-registry config:
llama_path = "C:\\dev\\llama.cpp\\build\\bin\\Release\\llama-server.exe"
default_model = "qwen-small"
default_mode = "secure"
max_tokens = 128
gpu_layers = 999
chat_template = "auto"
chat_context_token_budget = 2048
chat_context_turn_limit = 12
[[models]]
id = "qwen-small"
name = "Qwen 2.5 0.5B Instruct"
model_path = "C:\\models\\qwen2.5-0.5b\\qwen2.5-0.5b-instruct-q4_k_m.gguf"
max_tokens = 128
gpu_layers = 999
chat_template = "chatml"
chat_context_token_budget = 2048
chat_context_turn_limit = 12
[[models]]
id = "llama3-8b"
name = "Llama 3 8B Instruct"
model_path = "C:\\models\\llama3-8b\\meta-llama-3-8b-instruct-q4_k_m.gguf"
max_tokens = 256
gpu_layers = 999
chat_template = "llama3-instruct"
chat_context_token_budget = 3072
chat_context_turn_limit = 16
### Notes
gpu_layers = 999
means:
offload as many layers as possible onto the GPU
Additional active chat options:
chat_template = "auto"
chat_context_token_budget = 2048
chat_context_turn_limit = 12
Template options:
- `auto`
- `generic`
- `chatml`
- `llama3-instruct`
`chat_context_token_budget` and `chat_context_turn_limit` must both be greater than `0`.
Legacy single-model configs using only `model_path` are still supported. When no `[[models]]` array is present, NullContext synthesizes a default model entry automatically.
### Workspace Paths
NullContext session workspaces are created under the system temporary directory, in a `nullcontext` subdirectory.
Typical examples:
macOS/Linux: $TMPDIR/nullcontext or /tmp/nullcontext
Windows: %TEMP%\nullcontext
The exact path is determined at runtime using Rust's `std::env::temp_dir()`.
## Backend Runtime
Build the Rust runtime:
cargo build
Run directly:
echo "Explain secure local inference." | cargo run -- --stdin
Persistent session example:
echo "Explain persistent audit trails." | cargo run -- --mode standard --persistent --stdin
## Local API Server
Start the local API server:
cargo run -- serve
Default address:
http://127.0.0.1:3333
Health check:
http://127.0.0.1:3333/api/health
Streaming one-shot example:
curl -N -X POST http://127.0.0.1:3333/api/run/stream \
-H "Content-Type: application/json" \
-d '{"prompt":"Explain secure local inference in 2 short bullet points.","mode":"secure","persistent":false}'
Active chat example:
curl -X POST http://127.0.0.1:3333/api/chat/start \
-H "Content-Type: application/json" \
-d '{"mode":"secure","persistent":false,"model_id":"","corpus_id":""}'
For Windows/NVIDIA validation, capture the live `llama-server` PID during an active chat session and compare NullContext's report with host tooling:
Get-Process -Id
Get-CimInstance Win32_Process -Filter "ProcessId = "
nvidia-smi
nvidia-smi --query-compute-apps=pid,used_gpu_memory --format=csv,noheader,nounits
nvidia-smi pmon -c 1
## Web UI
From:
apps/web
Install dependencies:
pnpm install
Run development server:
pnpm dev
Default UI address:
http://localhost:5173
The active chat session config panel lets you:
- browse the registered model catalog
- browse the registered corpus catalog
- pick a model by ID/name before starting a session
- select a local corpus for grounded one-shot runs or for the next active chat session
- use per-model defaults or manual overrides for template/context settings
- choose a prompt template or auto-detect it from the model path
- set a bounded recent-context token budget
- set a bounded recent-context turn limit
The corpus browser also lets you:
- ingest corpora from absolute local file and directory paths
- ingest corpora from browser-selected local files with drag/drop
- ingest grounding files directly from the chat composer `+` menu
- inspect corpus lifecycle state and retained artifact paths
- load retained corpus reports
- run corpus reconcile, cleanup, and retention actions
The model browser also shows:
- whether each model file path is launchable
- whether the configured `llama-server` path is ready
- the exact model path, template default, token limit, GPU setting, and context defaults
After a session starts, the runtime banner shows the selected model, any bound corpus, the resolved template, and the active context policy.
## Session Commands
List persistent sessions:
cargo run -- --list-sessions
Show report:
cargo run -- --show-report
## Current Development Focus
The current development focus is:
- structured runtime streaming
- Server-Sent Events
- local corpus ingestion and retrieval lifecycle management
- streaming token output
- streaming audit events
- stronger memory hygiene primitives
- VRAM inspection and analysis
- llama runtime RAM/VRAM inspection and evidence-driven cleanup reporting
- Windows/NVIDIA runtime inspection validation
- forensic artifact visibility
- Linux-native low-level memory work
## Project Status
NullContext is currently in active early-stage development.
The project is functional and supports:
- local inference
- local browser UI
- local API execution
- one-shot streaming
- one-shot grounded retrieval
- active chat sessions
- active-chat grounded retrieval
- generation stop control
- explicit active chat cancellation
- persistent sessions
- lifecycle policy engine
- structured model registry and model switching
- txt/md/pdf corpus ingestion with hybrid OCR extraction
- browser-native corpus file uploads
- chatbot-style composer uploads for grounding corpora
- corpus lifecycle controls
- artifact tracking
- cleanup reporting
- audit visualization
- llama runtime inspection reports
However, the project should not yet be considered a hardened secure inference environment.
The current focus is building transparent runtime visibility and explicit lifecycle controls before attempting stronger low-level memory guarantees.
标签:通知系统