brandy-savage/hot-potato

GitHub: brandy-savage/hot-potato

Stars: 3 | Forks: 0

# hot-potato Capability-safe agent orchestration — prevents untrusted content from causing capability escalation in AI agents. ## The problem Hot-potato enforces it explicitly: - Every untrusted artifact gets a **taint label** (source, trust level, lineage) - **Detectors** scan for injection signals before any model sees the content - A **capability firewall** intercepts all tool calls and evaluates them against a YAML policy - A **trust graph** traces which external URL caused which tool execution - A **replay engine** benchmarks coverage across all attack categories without Docker ## Architecture Untrusted content (URL / file / RAG / tool output) │ ▼ TaintedArtifact ─── TrustLevel: UNTRUSTED / SEMI_TRUSTED / TRUSTED / SYSTEM │ lineage, content_hash, taint_tags ▼ DetectorPipeline ├── StaticDetector (regex, homoglyphs, encodings — fast, no Docker) └── BehavioralDetector (instruction-flow, authority-shift, priv-esc) │ ▼ taint_tags annotated CapabilityFirewall ─── PolicyEngine (YAML rules, first-match, dry-run mode) │ │ Outcomes: allow / deny / redact / require_human_review / │ sandbox_only / shadow_execute ▼ Tool execution (or block) │ ▼ TrustGraph ─── DAG: which source caused which tool call TelemetrySession ─── structured audit log, exportable JSON/JSONL ## Quick start # Screening (backwards-compatible) from hot_potato import safe_fetch result = safe_fetch("https://example.com") if result.clean: pass_to_real_ai(result.safe_content) else: print(f"Injection detected: {result.artifact['taint']['taint_tags']}") ## Agent integration from hot_potato.core.taint import TaintedArtifact, TrustLevel from hot_potato.core.capabilities import CapabilityFirewall, CapabilityRequest from hot_potato.detectors import DetectorPipeline # 1. Taint the artifact when it enters the pipeline artifact = TaintedArtifact(content=content, source=url, trust_level=TrustLevel.UNTRUSTED) # 2. Run detectors — annotates taint_tags artifact = DetectorPipeline.default().run(artifact) # 3. Before ANY tool call, check the firewall firewall = CapabilityFirewall() request = CapabilityRequest( tool_name="send_http", args={"url": "https://api.example.com", "data": payload}, tainted_inputs=[artifact], ) decision = firewall.evaluate(request) if decision.is_blocked: raise RuntimeError(f"Blocked: {decision.reason}") ## Batch screening with ArtifactSwarm from hot_potato import ArtifactSwarm from hot_potato.core.taint import TaintedArtifact, TrustLevel swarm = ArtifactSwarm(workers=8) artifacts = [ TaintedArtifact(content=c, source=url, trust_level=TrustLevel.UNTRUSTED) for url, c in urls_and_contents ] jobs = swarm.submit_many(artifacts) for result in swarm.as_completed(): if result.blocked: print(f"Blocked: {result.job_id} — {result.severity}") ## Policy Policies live in `policies/default.yaml`. Rules are declarative and evaluated top-down; first match wins. rules: - id: block_exfil_untrusted match: tools: ["send_http", "send_email", "send_crypto"] trust_levels: [UNTRUSTED] outcome: deny reason: "Outbound network from UNTRUSTED content is exfiltration" - id: sandbox_writes match: tools: ["write_file", "write_memory"] trust_levels: [UNTRUSTED] outcome: sandbox_only - id: human_review_crypto match: tools: ["get_private_key", "send_crypto", "sign_transaction"] trust_levels: ["*"] outcome: require_human_review Six outcomes: `allow` · `deny` · `redact` · `require_human_review` · `sandbox_only` · `shadow_execute` ## Trust levels | Level | Use case | |---|---| | `UNTRUSTED` | External URLs, user-supplied files, RAG results, tool outputs | | `SEMI_TRUSTED` | Internal APIs, cached content, outputs from TRUSTED processes | | `TRUSTED` | Operator's own codebase, verified configuration | | `SYSTEM` | Runtime itself — no injection possible | Trust never increases through derivation. Content derived from UNTRUSTED input stays UNTRUSTED even if processed by a trusted system. ## Framework integrations # OpenAI tool-call loop from integrations.openai_compat import GuardedToolExecutor executor = GuardedToolExecutor(tools=my_tools, model="gpt-4o") for tool_call in response.choices[0].message.tool_calls: result = executor.execute(tool_call, tainted_inputs=[artifact]) # MCP server from integrations.mcp_guard import MCPGuard guard = MCPGuard() decision = guard.evaluate_mcp_call("read_file", {"path": "/etc"}, tainted_sources=[artifact]) # LangChain from integrations.langchain_guard import GuardedTool, set_taint_context set_taint_context([artifact]) guarded_tool = GuardedTool.wrap(my_langchain_tool) ## Behavioral sandbox backends Hot-potato ships two behavioral sandbox backends. The static + firewall layers work without either. ### Docker backend (default) # One-time setup — pull model into named volume hot-potato-setup # Use (automatic when calling safe_fetch/scan_file with sandbox) HP_BACKEND=docker hot-potato file:///path/to/file.txt Requires Docker daemon. Uses `--network none`, 2 GB memory cap, overlay FS. Startup: ~3–8 s. ### Native backend (bwrap — no daemon required) # One-time setup — install bubblewrap + AppArmor profile apt install bubblewrap sudo cp setup/apparmor_bwrap.profile /etc/apparmor.d/bwrap sudo apparmor_parser -r /etc/apparmor.d/bwrap # Use HP_BACKEND=native hot-potato file:///path/to/file.txt Requires: `bwrap` (bubblewrap), Linux kernel 4.18+, Ollama running on localhost. Startup: ~200 ms. Isolation layers: | Layer | Docker | Native | |---|---|---| | Disposable FS | overlay2 | tmpfs root | | Process isolation | cgroup + namespace | PID + user namespace | | Network isolation | `--network none` | App-layer (fake handlers) | | Syscall filter | Docker default seccomp | Custom BPF (41 blocked + arch check) | | Capability drop | Docker defaults | `CAP_DROP ALL` + NO_NEW_PRIVS | | Root required | Yes (daemon) | No | See `docs/native_sandbox.md` for the full escape vector analysis covering symlink traversal, ptrace, SUID, fork bombs, netlink, and kernel exploits. from hot_potato import native_sandbox_available print(native_sandbox_available()) # True if bwrap is installed and userns enabled ## Benchmarking # Fast (no Docker) — static + behavioral + firewall layers python3 benchmarks/run_benchmark.py # Full sandbox (Docker) python3 benchmarks/run_benchmark.py --sandbox --out results/bench.json # Full sandbox (native) HP_BACKEND=native python3 benchmarks/run_benchmark.py --sandbox --out results/bench_native.json Results against 73 adversarial categories (SCANNER_VERSION 1.10.0): | Layer | Detection rate | Notes | |---|---|---| | Static (regex) | 98.6% (72/73) | cat6 intentionally signal-free — requires sandbox | | Behavioral | — (Phase 2) | | | Capability firewall | policy-complete | All defined rules fire correctly; coverage depends on your policy | ## Head-to-head vs other tools Same 73-category corpus, same known-good files ([full results](benchmarks/results/head_to_head.md)): | Tool | Detection rate | Critical FNs | FPs (known-good docs) | Avg latency | |---|---|---|---|---| | **hot-potato-static** | **98.6% (72/73)** | **0** | 5/5 ¹ | 35 ms | | llm-guard-v2 | 13.7% (10/73) | 62 | 0/5 | 73 ms | | rebuff-heuristic | 0.0% (0/73) | 72 | 0/5 | 87,745 ms | ¹ hot-potato FPs on the 5 known-good files are security documents that naturally contain injection vocabulary (IR playbooks, API references, etc.). This is expected and correct — assign `TrustLevel.TRUSTED` for first-party content. See [Known limitations](#known-limitations). False positive rate on legitimate skill files: **~9%** (630 skills scanned from skills.sh — FPs are code examples with `