woosal1337/prompt-injection-canary

GitHub: woosal1337/prompt-injection-canary

一种两阶段 prompt injection 检测器，不仅标记被注入的文本，还提取攻击者的对抗性指令和目标意图。

Stars: 1 | Forks: 0

# Prompt Injection Canary 本仓库是参考实现，也是所有三个课程交付物的存放地。 ## 交付物 | # | 交付物 | 位置 | |---|---|---| | 1 | **代码实现**（可复现的 pipeline + 本 README） | `src/canary/`, `scripts/`, `configs/`, `tests/`, `Makefile` | | 2 | **学术报告**（IEEE 模板，2–6 页） | [`报告`](report/) — `report/main.tex`（使用 `make pdf` 或在 Overleaf 上构建） | | 3 | **课堂演示**（10–15 分钟） | `presentation.ipynb` → `presentation.html`（阅读） / `presentation.slides.html`（演示） | 请参阅 [快速开始](#quickstart) 了解环境设置以及端到端复现 pipeline 的确切命令。 ## 它的功能现有的 prompt injection 检测器（PromptGuard、Deepset、ProtectAI、Lakera Guard）只输出一个单一的二元标签。这有两个已记录的失败模式： 1. **它们在 paraphrase、translation round-trip 和 encoded payloads 面前会崩溃。** 已发表的规避成功率接近 100%。 2. **二元标签是错误的 downstream API。** 宿主 agent 即便知道其输入是*被注入的*，也依然不知道**攻击者试图让它做什么**，因此它无法精准拒绝或有效升级处理。 canary 修复了这两个问题： ``` text ──▶ Pass 1 (fast multilingual encoder) ──▶ p_injection │ threshold τ │ above τ │ ▼ Pass 2 (instruction extractor) │ ▼ {status: "injected", extracted_instruction: "...", attacker_goal: "exfiltration", confidence: 0.92, evidence_span: "..."} ``` Pass 2 仅在 Pass-1 标记时触发，因此延迟成本受限于 Pass 1 的假阳性率。 ## 演示 notebook 一个独立且可在浏览器中查看的演示位于 `presentation.ipynb` 中，并被导出为两个静态 HTML 文件供网络使用。每个 cell 都能在 CPU 上一分钟内端到端运行。 | 文件 | 它是什么 | 如何打开 | |---|---|---| | `presentation.ipynb` | 交互式 Jupyter notebook（38 个 cell） | `jupyter lab presentation.ipynb` | | `presentation.html` | 静态线性页面 — 最适合阅读 | `open presentation.html`（或任何浏览器） | | `presentation.slides.html` | 静态 reveal.js 幻灯片 — 最适合演示 | `open presentation.slides.html` | **在浏览器中演示：** ``` # 最快路径 — 只需打开预构建的幻灯片文件 open presentation.slides.html # 或者在 localhost 上提供幻灯片（自动打开） jupyter nbconvert --to slides --post serve presentation.ipynb ``` **在编辑 `build_notebook.py` 后重新生成：** ``` python build_notebook.py # rebuild presentation.ipynb jupyter nbconvert --to notebook --execute --inplace presentation.ipynb \ --ExecutePreprocessor.kernel_name=canary --ExecutePreprocessor.timeout=600 jupyter nbconvert --to html presentation.ipynb # rebuild presentation.html jupyter nbconvert --to slides presentation.ipynb # rebuild presentation.slides.html ``` 该 notebook 使用 `canary` Jupyter kernel。只需注册一次： ``` python -m ipykernel install --user --name canary --display-name "canary" ``` ## 快速开始 ``` # 1. 将其安装（可编辑模式）到 virtualenv 中。 python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" # 2. 运行冒烟测试（离线，无 GPU，约30秒）。 make smoke # 3. 在手工制作的输入上运行 demo。 make demo # 4. 构建打包的微型语料库分割。 make data-tiny # 5. 在微型分割上运行完整 eval。 python scripts/05_eval_all.py --config configs/eval.yaml ``` 为了获取论文规模的数值，请先下载真实数据集： ``` python scripts/00_download_data.py # touches HackAPrompt / Tensor Trust / BIPIA / etc python scripts/01_build_splits.py # materializes the six eval splits python scripts/02_train_pass1.py # fine-tunes mDeBERTa python scripts/03_calibrate_threshold.py # picks the FPR=1% threshold python scripts/05_eval_all.py # produces report.json + summary.csv ``` ## 架构 | 组件 | 职责 | 默认 backend | |---|---|---| | `canary.pass1` | 快速多语种分类 | `protectai/deberta-v3-base-prompt-injection-v2` (HF)；离线时使用词法 fallback | | `canary.pass2` | 结构化指令提取 | `Qwen2.5-7B-Instruct` (HF)；离线时使用基于规则的 `echo` fallback；支持 `openai` / `anthropic` API backend | | `canary.pipeline` | 带有阈值门控的两阶段级联 | — | | `canary.data` | 用于 HackAPrompt / Tensor Trust / BIPIA / InjecAgent 的加载器 + 内置微型语料库 | HF Datasets | | `canary.augment` | Paraphrase、EN→TR/RU/ZH round-trip translation、base64/ROT13/zero-width encodings | 离线基于规则；在线使用 HF / OpenAI / Anthropic | | `canary.baselines` | 随机、多数、TF-IDF+LR、Deepset、ProtectAI、PromptGuard-2、Attention Tracker、GPT-judge | 统一的 `Detector` 协议 | | `canary.eval` | Macro F1、AUROC、FPR@TPR=0.95、span F1、提取忠实度、ECE、bootstrap CI、McNemar | — | | `canary.cli` | `canary predict / predict-batch / build-splits / calibrate / eval / smoke` | Click | 输出 schema 在 `canary.schema.SCHEMA_VERSION` 中进行版本控制，以便 downstream 消费者可以锁定稳定的契约。 ## 仓库布局 ``` prompt-injection-canary/ ├── README.md # this file ├── pyproject.toml # hatchling package + dev/api/translate extras ├── Dockerfile # CPU image with the CLI as entrypoint ├── Makefile # install / test / smoke / demo / data / train / eval / docker ├── configs/ # default.yaml + pass1.yaml + pass2.yaml + eval.yaml ├── src/canary/ │ ├── schema.py # public CanaryResult + ExtractedInstruction │ ├── taxonomy.py # 5-class attacker-goal taxonomy │ ├── pipeline.py # two-pass cascade │ ├── pass1/ # encoder + train + threshold calibration │ ├── pass2/ # extractor + prompts + JSON parsing/repair │ ├── data/ # loaders + tiny corpus + split builder │ ├── augment/ # paraphrase / translate / encode │ ├── baselines/ # 7-detector zoo with uniform interface │ ├── eval/ # metrics + bootstrap + McNemar + calibration + runner │ ├── cli.py # `canary ...` entry point │ └── utils.py # logging, seeds, config, JSONL, device ├── scripts/ # numbered drivers (00..06) + demo + smoke_test ├── tests/ # pytest suite (8 files) ├── docs/ # data card, model card, runbook ├── data/ # raw + processed splits (gitignored) └── artifacts/ # checkpoints, eval reports (gitignored) ``` ## 配置每个脚本和 CLI 命令都会读取 YAML 配置。配置通过 `inherits: default.yaml` 继承。在 CLI 中覆盖以进行一次性实验。与生产相关的核心配置项： - `pass1.model_name` — encoder backbone。默认为对离线友好的 ProtectAI；如果你有 HF token，可切换为 `meta-llama/Llama-Prompt-Guard-2-86M`，或者使用 `microsoft/mdeberta-v3-base` 从头训练 Canary-T。 - `pass1.target_fpr` — 校准目标（默认为 1%）。 - `pass2.backend` — `echo`（离线，无网络）、`hf`（本地 LLM）、`openai` 或 `anthropic`。 - `data.tiny_only` — 强制使用内置的微型语料库（用于 smoke test、CI）。 - `eval.splits_to_run` / `eval.baselines` — 选择一个子集以进行快速迭代。 ## 如何扩展通过实现 `canary.baselines.base.Detector` 并在 `canary.baselines.registry.REGISTRY` 中注册，即可投入新的检测器。eval 运行器会自动识别它。通过在 `canary.augment` 中公开一个函数并将其接入 `canary.data.splits.build_splits` 来添加新的数据增强。通过追加到 `canary.taxonomy.AttackerGoal` 来添加新的 attacker-goal 标签（切勿重新排序现有值 — JSON 输出是公共 API 的一部分）。 ## 复现论文提案提交了以下 splits： | Split | 大小 | 来源 | |---|---|---| | Direct-EN | ~1500 | HackAPrompt + Tensor Trust | | Indirect-EN | ~1250 | BIPIA | | Paraphrased | 1000 | LLM-paraphrased（留出 paraphraser） | | Translated | 1200 | EN ↔ {TR, RU, ZH} round-trip | | Encoded | 600 | base64 / ROT13 / zero-width | | Overdefense | 800 | 包含触发词的良性 prompt | 核心指标：macro F1、AUROC、FPR@TPR=0.95、span F1、提取忠实度、延迟 p50/p95、成本 / 1k 输入。对所有内容计算 Bootstrap 95% CI。对成对分类器比较进行 McNemar 测试。 ## 许可证 MIT。

标签：AI安全, Chat Copilot, Python, 凭据扫描, 后端开发, 安全规则引擎, 无后门, 请求拦截, 逆向工具