RootTwoOverTwelve/ipi_defense_eval_may29

GitHub: RootTwoOverTwelve/ipi_defense_eval_may29

该项目在统一数据集上对六种现成提示注入防御方案进行 zero-shot 与 LODO 微调的系统性基准评估，提供公平的量化对比。

Stars: 0 | Forks: 0

# 针对 5 月 29 日统一数据集的现成 prompt 注入防御 Patrick（Marlowe 上的 `prli`）。作为 Baihan 激活探针工作的配套——使用相同的数据集和相同的 LODO 协议，评估**基于文本的现成防御**，包括 zero-shot 和经过 LODO fine-tuning 之后，以便我们有一个公平的对比基准。 ## TL;DR — 核心数据概览 **Zero-shot**（六种现成防御，无训练）： | 防御 | TPR | FPR（合并 alt+native） | |---|---:|---:| | PIGuard (`leolee99/PIGuard`) | 61.7% | 15.7% | | PromptGuard 1 (`Prompt-Guard-86M`) | 67.8% | 43.4% | | PromptGuard 2 (`Llama-Prompt-Guard-2-86M`) | 21.2% | 0.22% | | ProtectAI v1 (`deberta-v3-base-prompt-injection`) | 13.5% | 0.29% | | ProtectAI v2 (`-v2`) | 30.2% | 10.7% | | TaskTracker (Llama-3-8B probe @ layer 31, thr 0.9) | 56.3% | 22.2% | **LODO fine-tuned**（5 个数据集 × 5 折 × seed 0 = 25 次运行；AUC 和准确率为在可部署阈值下跨折的加权值）： | 防御 | AUC | Acc | TPR@deploy | FPR@deploy | |---|---:|---:|---:|---:| | PIGuard | **0.926** | 74.1% | 48.4% | 0.25% | | PromptGuard 1 | 0.907 | 70.3% | 42.5% | 2.02% | | PromptGuard 2 | 0.856 | 63.4% | 27.2% | 0.41% | | ProtectAI v1 | 0.722 | 56.7% | 56.7% | **43.2%** | | ProtectAI v2 | **0.936** | 68.6% | 40.9% | 3.63% | | Baihan R_AB（参考，用于对比） | **0.937** | **77.2%** | ~77% | ~6.4% | 详细的各数据集细分及各折数据见 `outputs/zero_shot/README.md` 以及 `headline_table.json` 和各防御的 `*.summary.json` 文件。 ## 仓库结构 ``` ipi_defense_eval_may29/ ├── README.md # this file ├── METHODOLOGY.md # the methodology choices in detail (read this first) ├── scripts/ │ ├── eval_defenses_may29.py # zero-shot eval of 5 HF-classifier defenses + TaskTracker wrapper │ ├── eval_defenses_may29.sbatch │ ├── eval_tasktracker_may29.py # TaskTracker (Llama-3-8B activations + linear probe) │ ├── eval_tasktracker_may29.sbatch │ ├── finetune_defense_lodo.py # LODO fine-tune for 5 HF defenses │ ├── finetune_defense_lodo.sbatch │ └── plot_defense_results.py # 2-panel bar chart of all 6 defenses ├── outputs/ │ ├── zero_shot/ # 16,550-row inference results, 6 defenses + PG1 all-scores variant │ │ ├── README.md # detailed per-defense breakdown │ │ ├── may29_.jsonl # per-row predictions │ │ ├── may29_.summary.json │ │ ├── headline_table.json # cross-defense aggregate │ │ ├── defenses_bar_chart.png # 6-defense comparison chart │ │ └── defenses_bar_chart.pdf │ └── lodo/ # 25 (defense × fold) LODO fine-tune results │ └── may29__lodo__seed0.{jsonl,summary.json} ``` ## 快速开始：与 Baihan 的数据交叉验证每行 JSONL 都具有以下 schema： ``` {"pair_id": "...", "dataset": "...", "label": "...", "origin": "...", "detect_flag": true/false, "score": 0.0-1.0} ``` 从任意文件重新计算汇总数据： ``` import json from collections import defaultdict rows = [json.loads(l) for l in open("outputs/lodo/may29_piguard_lodo_AgentDojo_p_AgentDyn_seed0.jsonl")] cells = defaultdict(lambda: {"n":0, "flagged":0}) for r in rows: k = (r["dataset"], r["label"], r["origin"]) cells[k]["n"] += 1 if r["detect_flag"]: cells[k]["flagged"] += 1 for k, v in cells.items(): print(k, v["flagged"]/v["n"]) ``` `summary.json` 文件已经预先计算了可部署 + oracle 阈值以及各单元格的细分。 ## 数据集所有评估均在 Baihan 的 5 月 29 日统一文件（16,550 行，5 个数据集）上运行： `/scratch/m000243/baihan/Training data for May 29/Unified Training Data/unified_training_data.jsonl` 位于 Marlowe。 | 数据集 | n_malicious | n_alt_benign | n_native_benign | |---|---:|---:|---:| | AgentDojo + AgentDyn | 2000 | 1645 | 355 | | BIPIA | 2000 | 1998 | 2 | | LLMail-Inject | 2000 | 1762 | 238 | | SEP | 2000 | 1996 | 4 | | CyberSecEval | 275 | 275 | 0 | ## 在 Marlowe 上复现 ``` # Zero-shot sbatch --export=ALL,DEFENSE=,BATCH_SIZE=32 \ --time=01:00:00 scripts/eval_defenses_may29.sbatch # TaskTracker (需要在 /scratch/m000243/prli/TaskTracker/ 中包含 Llama-3-8B + Microsoft 的 probe checkpoints) sbatch --export=ALL,BATCH_SIZE=8 --time=02:00:00 scripts/eval_tasktracker_may29.sbatch # LODO fine-tune (每个 defense 5 个任务，每个 held-out dataset 一个) for ds in "AgentDojo + AgentDyn" "BIPIA" "LLMail-Inject" "SEP" "CyberSecEval"; do sbatch --export=ALL,DEFENSE=piguard,HELD_OUT="$ds",EPOCHS=3,LR=1e-5,BATCH_SIZE=16 \ --time=02:00:00 scripts/finetune_defense_lodo.sbatch done ``` HF model cache：`/scratch/m000243/prli/hf_cache/`。这 5 个分类器模型是受限/私有的；PromptGuard 1/2 需要具有 `meta-llama` 访问权限的 HF_TOKEN。 ## 需要预先指出的注意事项 1. **HF classifier 防御会忽略用户 prompt。** 我们通过每个防御 batch 函数的 `target_insts` 参数将其传入，但实际的分类器只能看到不受信任的文本。TaskTracker 是该组中唯一同时使用两个输入的防御（通过两次 Qwen 前向传播的 activation delta）。 2. **单 seed (seed=0)** —— 尚无方差范围。 3. **PromptGuard fine-tune 使用 max_seq_len=512**（而非其原生的 2048），以匹配其他 4 种防御的上下文长度并避免 OOM。大约 ~6% 的 5 月 29 日观测数据超过了 512 个 token。 4. **阈值迁移问题影响了所有防御** —— 在可部署阈值下（开发集恶意样本 TPR 固定为 0.95），留存集 TPR 骤降；在 oracle 阈值下（测试集恶意样本 TPR 强制为 0.95），FPR 暴增至 30-95%。这与 Baihan 在其 activation probe 上看到的模式相同 —— 这是 LODO + 饱和的 training-dist confidence 的特性，而非文本分类器特有。

标签：Apex, DLL 劫持, Trivy, 人工智能, 大语言模型, 时序数据库, 机器学习, 模型微调, 用户模式Hook绕过, 逆向工具