ppcvote/prompt-defense-audit-py

GitHub: ppcvote/prompt-defense-audit-py

针对 LLM 响应中危险 payload 的确定性正则扫描器，映射 OWASP LLM02，用于在输出到达下游系统前进行安全拦截。

Stars: 0 | Forks: 0

# prompt-defense-audit (Python) [![PyPI](https://img.shields.io/pypi/v/prompt-defense-audit.svg)](https://pypi.org/project/prompt-defense-audit/) [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://opensource.org/licenses/MIT) 确定性的正则表达式扫描器，可在危险 payload 到达下游系统（HTML 渲染、数据库、Shell、API）**之前**检测 LLM 响应中的这些内容。映射到 **OWASP LLM02 — 不安全的输出处理 (Insecure Output Handling)**。这是[同名 npm 包](https://www.npmjs.com/package/prompt-defense-audit)的 Python 移植版，在输出扫描器上实现了**逐字节一致性**：相同的规则、相同的匹配、相同的去重窗口、相同的风险级别阈值、相同的摘要字符串。一致性测试套件 (`tests/test_parity.py`) 确保了这两个实现的一致性。 ## 为什么需要这个项目当 LLM 输出的文本被传递到浏览器、数据库、Shell 或其他代理时，LLM 训练时的安全防护机制就不起作用了。静态输出扫描是一个确定性的、耗时不到 5ms 的关卡，你可以将其放置在模型和危险接收端之间。 - **不调用 LLM。** 纯正则表达式，完全确定性。 - **无依赖。** 仅使用标准库。 - **22 条威胁规则**，涵盖 7 个类别：XSS、SQL 注入、Shell 命令注入、路径遍历、凭证泄露、Markdown 注入、代码注入。 - **风险级别升级**，从 `safe` → `low` → `medium` → `high` → `critical`。 - **一致性测试**，对照 TypeScript 参考实现。 ## 安装 ``` pip install prompt-defense-audit ``` ## 快速开始 ``` from prompt_defense_audit import scan_output # 包含 script tag 的 LLM 生成响应： output = 'Here is the greeting: ' result = scan_output(output) print(result.safe) # False print(result.risk_level) # 'critical' print(result.summary) # 'Found 1 threat(s): 1 critical, 0 high. Do NOT pass this output...' for t in result.threats: print(f" [{t.severity}] {t.id}: {t.match!r} at position {t.position}") ``` 输出： ``` False critical Found 1 threat(s): 1 critical, 0 high. Do NOT pass this output to downstream systems without sanitization. [critical] xss-script-tag: '' at position 21 ``` ## 作为中间件使用此扫描器最有效的位置是**在 LLM 和下游接收端之间**——一个轻量级的守卫，在遇到严重威胁时会闭合阻断（安全失败），并记录中等严重性的事件。 ``` from prompt_defense_audit import scan_output def safe_render(llm_output: str) -> str: result = scan_output(llm_output) if result.risk_level in ("critical", "high"): raise ValueError(f"LLM output rejected: {result.summary}") return llm_output # safe to forward ``` 对于从联邦源接收数据的 MCP 服务器（其中任何上游内容都可能是恶意构造的），在将**每个出站响应**返回给调用代理之前，请通过 `scan_output()` 进行包装处理。 ## 公共 API ``` from prompt_defense_audit import scan_output, OutputScanResult, OutputThreat result: OutputScanResult = scan_output("...") # result.safe : bool # result.threats : list[OutputThreat] # result.risk_level : Literal["safe", "low", "medium", "high", "critical"] # result.summary : str # 每个 OutputThreat 包含： # .id : str (稳定规则 id，例如 "xss-script-tag") # .name : str (人类可读的规则名称) # .severity : Literal["critical", "high", "medium", "low"] # .match : str (匹配的 payload，截断至 100 个字符) # .position : int (被扫描字符串中的起始索引) # .context : str (±20 字符的窗口，其中换行符被展平) # 两个 dataclass 均提供用于 JSON 序列化的 .to_dict()。 ``` ## 规则目录（7 个类别中的 22 条规则） | 类别 | 规则 ID | 严重程度 | |---|---|---| | **XSS** | `xss-script-tag`, `xss-event-handler`, `xss-javascript-uri`, `xss-data-uri-html`, `xss-iframe-srcdoc`, `xss-svg-script` | critical × 3, high × 3 | | **SQL 注入** | `sqli-destructive`, `sqli-union`, `sqli-comment-bypass` | critical × 1, high × 1, medium × 1 | | **Shell 命令注入** | `shell-pipe-exec`, `shell-destructive`, `shell-reverse`, `shell-env-exfil` | critical × 3, high × 1 | | **路径遍历** | `path-traversal` | high × 1 | | **凭证泄露** | `credential-api-key`, `credential-private-key`, `credential-connection-string`, `credential-jwt` | critical × 3, high × 1 | | **Markdown 注入** | `markdown-link-injection`, `markdown-image-tracking` | high × 1, medium × 1 | | **代码注入** | `code-eval`, `code-python-import` | high × 1, medium × 1 | `rm -rf /tmp/...` 被破坏性 Shell 规则明确允许，因为在教程输出中，清理 `/tmp/...` 是常见的合法操作。 ## 与 TypeScript 参考实现的一致性位于 [`ppcvote/prompt-defense-audit`](https://github.com/ppcvote/prompt-defense-audit) 的 TypeScript 参考实现是权威实现。这个 Python 移植版在涵盖所有规则、边缘情况以及聚合逻辑（去重窗口、风险级别升级、摘要文本）的 50 多个 fixture 套件上与其实现了**逐字节一致**。一致性测试 (`tests/test_parity.py`) 在相同的 fixture 上运行这两个实现，并逐项比较结果。该契约在每次测试运行时都会得到强制执行。如果您发现任何输入导致这两个实现出现分歧，请[提交一个 issue](https://github.com/ppcvote/prompt-defense-audit-py/issues)——这是我们需要了解的一致性错误。 ## 能检测什么和不能检测什么 **能检测：** 已经到达输出缓冲区的危险 payload。扫描器不假设 payload 是*如何*到达那里的——无论是 LLM 幻觉产生的、上游文档污染了上下文，还是用户精心构造的提示词引出的。 **不能检测：** 1. 模型*是否将要*发出危险内容——扫描器是一个运行时检查，而不是静态提示词审计。有关系统提示词的部署前静态分析，请参阅 [`prompt-defense-audit` npm 包的输入扫描器](https://github.com/ppcvote/prompt-defense-audit)（目前仅支持 TypeScript）。 2. 语义威胁。扫描器基于正则表达式，无法检测例如自然语言中“给攻击者钱”这类意译的指令。 3. 下游接收端是否真的可以被利用。在输出中检测到的 `