brandy-savage/hot-potato

GitHub: brandy-savage/hot-potato

为 AI Agent 提供基于污点追踪和能力防火墙的 Prompt 注入防护框架，防止不可信内容引发越权操作。

Stars: 4 | Forks: 0

# hot-potato 能力安全的 agent 编排 —— 防止不受信任的内容在 AI agent 中引发权限提升。 ## 问题所在任何浏览网页、读取文件或使用 RAG 的 AI agent，只要碰到一个恶意文档，就可能泄露凭据、写入文件系统，或者执行操作者根本未曾预期的链式工具调用。内容传达给模型；模型调用工具；工具引发现实世界的后果。这种信任边界在设计上就是模糊的。 Hot-potato 明确地执行了该边界： - 每个不受信任的产物都会获得一个**污点标签**（来源、信任级别、谱系） - **检测器**在任何模型看到内容之前扫描注入信号 - **能力防火墙**拦截所有工具调用，并根据 YAML 策略对其进行评估 - **信任图**追踪哪个外部 URL 导致了哪个工具执行 - **重放引擎**在不使用 Docker 的情况下，跨所有攻击类别对覆盖率进行基准测试 ## 架构 ``` Untrusted content (URL / file / RAG / tool output) │ ▼ TaintedArtifact ─── TrustLevel: UNTRUSTED / SEMI_TRUSTED / TRUSTED / SYSTEM │ lineage, content_hash, taint_tags ▼ DetectorPipeline ├── StaticDetector (regex, homoglyphs, encodings — fast, no Docker) └── BehavioralDetector (instruction-flow, authority-shift, priv-esc) │ ▼ taint_tags annotated CapabilityFirewall ─── PolicyEngine (YAML rules, first-match, dry-run mode) │ │ Outcomes: allow / deny / redact / require_human_review / │ sandbox_only / shadow_execute ▼ Tool execution (or block) │ ▼ TrustGraph ─── DAG: which source caused which tool call TelemetrySession ─── structured audit log, exportable JSON/JSONL ``` ## 快速开始 ``` # 筛选（向后兼容） from hot_potato import safe_fetch result = safe_fetch("https://example.com") if result.clean: pass_to_real_ai(result.safe_content) else: print(f"Injection detected: {result.artifact['taint']['taint_tags']}") ``` ## Agent 集成 ``` from hot_potato.core.taint import TaintedArtifact, TrustLevel from hot_potato.core.capabilities import CapabilityFirewall, CapabilityRequest from hot_potato.detectors import DetectorPipeline # 1. 当 artifact 进入 pipeline 时对其进行 taint artifact = TaintedArtifact(content=content, source=url, trust_level=TrustLevel.UNTRUSTED) # 2. 运行 detectors — 标注 taint_tags artifact = DetectorPipeline.default().run(artifact) # 3. 在 ANY tool call 之前，检查 firewall firewall = CapabilityFirewall() request = CapabilityRequest( tool_name="send_http", args={"url": "https://api.example.com", "data": payload}, tainted_inputs=[artifact], ) decision = firewall.evaluate(request) if decision.is_blocked: raise RuntimeError(f"Blocked: {decision.reason}") ``` ## 使用 ArtifactSwarm 进行批量筛查 ``` from hot_potato import ArtifactSwarm from hot_potato.core.taint import TaintedArtifact, TrustLevel swarm = ArtifactSwarm(workers=8) artifacts = [ TaintedArtifact(content=c, source=url, trust_level=TrustLevel.UNTRUSTED) for url, c in urls_and_contents ] jobs = swarm.submit_many(artifacts) for result in swarm.as_completed(): if result.blocked: print(f"Blocked: {result.job_id} — {result.severity}") ``` ## 策略策略位于 `policies/default.yaml`。规则是声明式的并自上而下进行评估；首条匹配生效。 ``` rules: - id: block_exfil_untrusted match: tools: ["send_http", "send_email", "send_crypto"] trust_levels: [UNTRUSTED] outcome: deny reason: "Outbound network from UNTRUSTED content is exfiltration" - id: sandbox_writes match: tools: ["write_file", "write_memory"] trust_levels: [UNTRUSTED] outcome: sandbox_only - id: human_review_crypto match: tools: ["get_private_key", "send_crypto", "sign_transaction"] trust_levels: ["*"] outcome: require_human_review ``` 六种结果：`allow` · `deny` · `redact` · `require_human_review` · `sandbox_only` · `shadow_execute` ## 信任级别 | 级别 | 用例 | |---|---| | `UNTRUSTED` | 外部 URL、用户提供的文件、RAG 结果、工具输出 | | `SEMI_TRUSTED` | 内部 API、缓存内容、来自 TRUSTED 进程的输出 | | `TRUSTED` | 操作者自己的代码库、已验证的配置 | | `SYSTEM` | Runtime 本身 —— 不可能被注入 | 信任度永远不会通过派生而增加。即使经过受信任系统的处理，从 UNTRUSTED 输入派生的内容依然是 UNTRUSTED。 ## 框架集成 ``` # OpenAI tool-call 循环 from integrations.openai_compat import GuardedToolExecutor executor = GuardedToolExecutor(tools=my_tools, model="gpt-4o") for tool_call in response.choices[0].message.tool_calls: result = executor.execute(tool_call, tainted_inputs=[artifact]) # MCP server from integrations.mcp_guard import MCPGuard guard = MCPGuard() decision = guard.evaluate_mcp_call("read_file", {"path": "/etc"}, tainted_sources=[artifact]) # LangChain from integrations.langchain_guard import GuardedTool, set_taint_context set_taint_context([artifact]) guarded_tool = GuardedTool.wrap(my_langchain_tool) ``` ## 行为沙箱后端 Hot-potato 内置了两个行为沙箱后端。静态层 + 防火墙层在没有这两个后端的情况下也能工作。 ### Docker 后端（默认） ``` # 一次性设置 — 将 model 拉取到 named volume 中 hot-potato-setup # 使用（在使用 safe_fetch/scan_file 和 sandbox 调用时自动执行） HP_BACKEND=docker hot-potato file:///path/to/file.txt ``` 需要 Docker daemon。使用 `--network none`，2 GB 内存限制，overlay FS。启动时间：约 3–8 秒。 ### 原生后端 (bwrap — 无需 daemon) ``` # 一次性设置 — 安装 bubblewrap + AppArmor profile apt install bubblewrap sudo cp setup/apparmor_bwrap.profile /etc/apparmor.d/bwrap sudo apparmor_parser -r /etc/apparmor.d/bwrap # 使用 HP_BACKEND=native hot-potato file:///path/to/file.txt ``` 需要：`bwrap` (bubblewrap)、Linux kernel 4.18+、在 localhost 上运行的 Ollama。启动时间：约 200 ms。隔离层： | 层级 | Docker | 原生 | |---|---|---| | 一次性 FS | overlay2 | tmpfs root | | 进程隔离 | cgroup + namespace | PID + user namespace | | 网络隔离 | `--network none` | 应用层（伪处理器） | | Syscall 过滤器 | Docker 默认 seccomp | 自定义 BPF（拦截 41 个 + 架构检查） | | Capability 剥离 | Docker 默认配置 | `CAP_DROP ALL` + NO_NEW_PRIVS | | 需要 Root | 是（daemon） | 否 | 有关涵盖 symlink 遍历、ptrace、SUID、fork 炸弹、netlink 和 kernel 漏洞的完整逃逸向量分析，请参阅 `docs/native_sandbox.md`。 ``` from hot_potato import native_sandbox_available print(native_sandbox_available()) # True if bwrap is installed and userns enabled ``` ## 基准测试 ``` # 快速（无 Docker）— static + behavioral + firewall 层 python3 benchmarks/run_benchmark.py # 完整 sandbox（Docker） python3 benchmarks/run_benchmark.py --sandbox --out results/bench.json # 完整 sandbox（原生） HP_BACKEND=native python3 benchmarks/run_benchmark.py --sandbox --out results/bench_native.json ``` 针对 73 个对抗类别（SCANNER_VERSION 1.10.0）的结果： | 层级 | 检出率 | 备注 | |---|---|---| | 静态 (regex) | 98.6% (72/73) | cat6 故意设置为无信号 —— 需要沙箱 | | 行为 | — (Phase 2) | | | 能力防火墙 | 策略完备 | 所有已定义规则均正确触发；覆盖率取决于您的策略 | ## 与其他工具的正面对比相同的 73 类语料库，相同的已知正常文件（[完整结果](benchmarks/results/head_to_head.md)）： | 工具 | 检出率 | 严重漏报 | 误报 (已知正常文档) | 平均延迟 | |---|---|---|---|---| | **hot-potato-static** | **98.6% (72/73)** | **0** | 5/5 ¹ | 35 ms | | llm-guard-v2 | 13.7% (10/73) | 62 | 0/5 | 73 ms | | rebuff-heuristic | 0.0% (0/73) | 72 | 0/5 | 87,745 ms | ¹ hot-potato 在这 5 个已知正常文件上的误报，属于自然包含注入词汇的安全文档（IR playbook、API 参考等）。这是预期且正确的行为 —— 对于第一方内容，请指定 `TrustLevel.TRUSTED`。参见[已知局限性](#known-limitations)。合法 skill 文件的误报率：**~9%**（扫描了来自 skills.sh 的 630 个 skill —— 误报主要来自带有 `