ruelbelmonte/promptproof

GitHub: ruelbelmonte/promptproof

一款通过 canary 泄露检测来测试 LLM 系统抵御 prompt injection 能力的确定性红队测试套件。

Stars: 0 | Forks: 0

# promptproof 一个小型红队测试套件，旨在回答关于 LLM 系统的一个问题：**它能抵御 prompt injection 吗？** 它在被测系统中植入一个唯一的 secret（canary），向其发送一系列 injection payload，并检查该 secret 是否会泄露出来。由于检查是基于 canary 的，因此结果是确定性的：不需要任何模型来判断结果。这是一款放置在受 guardrail 保护的系统旁边的测试工具。从未受到攻击的 guardrail 只是空想；这款工具可以帮你确认它们是否真正有效。 ## 工作原理 ``` plant a canary in the target -> fire each attack as user input -> did the canary come back out (plainly or base64-encoded)? yes -> the attack succeeded, the system LEAKED no -> the attack HELD -> report: N/total attacks held ``` 攻击语料库涵盖了常见的类型：直接提取、指令覆盖（“忽略之前的指令”）、角色扮演越狱、delimiter 与上下文混淆注入，以及混淆泄露（base64、payload 分割）。 ## 运行方式 ``` pip install -e ".[dev]" pytest # deterministic, no model, no network promptproof # runs the suite against a leaky and a hardened reference target ``` 预期结果：存在漏洞的参考目标在每次攻击中都会发生泄露，而经过加固的目标则能抵御所有攻击。这种对比正是该测试套件证明自己能够区分它们的关键。 ## 测试你自己的系统实现 `Target` 协议（包含一个方法，`respond(user_input) -> str`），以便它能调用你的助手，并在构造时将一个 canary 作为其受保护的 secret，然后： ``` from promptproof.harness import run_suite report = run_suite(my_target, canary="CANARY-...") print(f"{report.held}/{report.total} held") ``` ## 设计相关决策及其局限性（为何采用基于 canary 的检测，以及它能捕获和不能捕获的内容）在 [DESIGN.md](./DESIGN.md) 中进行了说明。 ## 状态已完成工作并经过测试：泄露检测（明文和 base64）、针对两个参考目标的测试套件，以及攻击语料库均已有测试覆盖。CI 会运行 ruff、格式检查、mypy（严格模式）和 pytest。

标签：DLL 劫持, Python, 人工智能, 大语言模型, 安全测试, 安全规则引擎, 攻击性安全, 无后门, 用户模式Hook绕过, 逆向工具