jkemnitzer/mechinterp-injection-suite

GitHub: jkemnitzer/mechinterp-injection-suite

CIRCE 是一个基于机制可解释性方法，在指令微调 LLM 中因果定位 prompt 注入行为背后最小注意力头电路的研究工具。

Stars: 1 | Forks: 0

# CIRCE **因果注入-响应电路探索器** 定位经过指令微调的 LLM 中 prompt 注入背后的 attention circuit。在 [TransformerLens](https://github.com/TransformerLensOrg/TransformerLens) 的协助下构建。 [![Python](https://img.shields.io/badge/python-3.11-blue)](https://www.python.org/) [![TransformerLens](https://img.shields.io/badge/TransformerLens-v2-orange)](https://github.com/TransformerLensOrg/TransformerLens) [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE) [![HuggingFace](https://img.shields.io/badge/🤗-HuggingFace-yellow)](https://huggingface.co/) [![Pydantic](https://img.shields.io/badge/Pydantic-v2-red)](https://docs.pydantic.dev/) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![CUDA](https://img.shields.io/badge/CUDA-enabled-76b900?logo=nvidia)](https://developer.nvidia.com/cuda-toolkit)

## 设置 ### 1. 环境 ``` conda env create -f environment.yml conda activate lab pip install -e . ``` ### 2. 配置将 `.env.example` 复制为 `.env` 并填写您的凭证： ``` cp .env.example .env ``` | 变量 | 描述 | |---|---| | `HF_TOKEN` | HuggingFace 访问 token | | `HF_HOME` | 模型权重缓存路径（例如 `/data/models`） | | `ABLATION_DEVICE` | 设备覆盖项：`cuda`、`mps` 或 `cpu`（如未设置则自动检测） | | `JUDGE_API_KEY` | 用于 LLM judge 的 API key | | `JUDGE_API_BASE` | Judge API 的 Base URL（例如 `https://.../api/v1`） | | `JUDGE_MODEL` | Judge 模型 ID（例如 `lisa-flash-03-2026`） | ## 可用模型 | 键名 | 模型 | |---|---| | `gemma2-2b-it` | google/gemma-2-2b-it | | `gemma2-9b-it` | google/gemma-2-9b-it | | `gemma-3-4b-it` | google/gemma-3-4b-it | | `gemma-3-12b-it` | google/gemma-3-12b-it | | `qwen2-7b-it` | Qwen/Qwen2-7B-Instruct | | `qwen2.5-0.5b-instruct` | Qwen/Qwen2.5-0.5B-Instruct | | `qwen2.5-1.5b-instruct` | Qwen/Qwen2.5-1.5B-Instruct | | `qwen2.5-3b-instruct` | Qwen/Qwen2.5-3B-Instruct | | `qwen2.5-7b-instruct` | Qwen/Qwen2.5-7B-Instruct | | `qwen2.5-14b-instruct` | Qwen/Qwen2.5-14B-Instruct | | `llama3-8b-it` | meta-llama/Meta-Llama-3-8B-Instruct | | `llama3.1-8b-it` | meta-llama/Llama-3.1-8B-Instruct | | `llama3.2-1b-instruct` | meta-llama/Llama-3.2-1B-Instruct | | `llama3.2-3b-instruct` | meta-llama/Llama-3.2-3B-Instruct | | `phi-4-it` | microsoft/phi-4 | | `phi-3-mini-it` | microsoft/Phi-3-mini-4k-instruct | | `mistral-7b-instruct` | mistralai/Mistral-7B-Instruct-v0.1 | | `apertus-8b-instruct` | swiss-ai/Apertus-8B-Instruct-2509 | | `yi-6b-chat` | 01-ai/Yi-6B-Chat | `gemma-3-12b-it` 和 `qwen2.5-14b-instruct` 无法在单张 24GB GPU 上容纳 bf16 权重和激活值。它们在 `ablation/config.py` 中的 `MULTI_GPU_MODELS` 中列出，并会通过 TransformerLens 的 `n_devices` 自动分片到 2 张 GPU 上（hook 仍然透明地工作。与 HF 的 `device_map="auto"` 不同，分片由 TL 接管）。请在确保两个设备索引均可见的情况下启动它们，例如 `CUDA_VISIBLE_DEVICES=0,1`；如果可见的 GPU 数量少于所需数量，`ablation run` 将抛出明确的错误。 ## Pipeline ``` flowchart TD classDef setup fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f classDef discovery fill:#dcfce7,stroke:#22c55e,color:#14532d classDef characts fill:#f3e8ff,stroke:#a855f7,color:#3b0764 classDef io fill:#fff7ed,stroke:#f97316,color:#7c2d12 INPUT(["`INPUT Instruction-tuned LLM Dataset: 100 injection pairs`"]):::io subgraph PHASE1["⚙️ PHASE 1 — Setup"] CAL["[1] CALIBRATE Forward pass on 50 clean prompts → mean_hook_z.pt"]:::setup BAS["[2] BASELINE Run all 100 pairs → working_pairs.json"]:::setup CAL --> BAS end subgraph PHASE2["🔍 PHASE 2 — Discovery"] SWE["[3] SWEEP Ablate every head individually → sweep.ndjson"]:::discovery ANA["[4] ANALYZE Binomial test per head → stats.json + heatmap"]:::discovery LR["[5] LOGIT-RANK Mean logit delta per head → logit_ranking.json"]:::discovery CAN["[6] CANDIDATES Intersect p-value ∩ logit rank → PRIMARY HEADS"]:::discovery SWE --> ANA --> LR --> CAN end subgraph PHASE3["🔬 PHASE 3 — Characterization"] DLA["[7] DLA Direct logit attribution"]:::characts ATT["[8] ATTN Attention pattern analysis"]:::characts AP["[9] ACT-PATCH Causal sufficiency · relay gap"]:::characts EP["[10] EARLIER-PATCH Relay position in span"]:::characts LC["[11] LEGIT-CTRL Benign generation control"]:::characts PRB["[12] PROBE Generalisation to new verbs"]:::characts CIR["[13] CIRCUIT Multi-head minimal circuit"]:::characts MAP["[14] ACT-PATCH --multi-head Full causal sufficiency of circuit"]:::characts CIR -->|ENSEMBLE| MAP end OUT(["`OUTPUT Primary head · Relay gap Circuit size · GENDISR = 0`"]):::io INPUT --> CAL BAS -->|working pairs| SWE CAN -->|primary head| DLA & ATT & AP & EP & LC & PRB & CIR DLA & ATT & AP & EP & LC & PRB & MAP --> OUT ``` ### 交互式启动器跨多个模型运行 pipeline 或防御评估的最简单方法是使用交互式启动器： ``` bash scripts/launcher.sh ``` 它会依次显示三个 fzf 选择界面： 1. **命令** — `run`（完整 pipeline）或 `defense`（仅防御评估） 2. **模型** — 所有已注册的模型，按 Tab 键进行多选 3. **设备** — 显示具有空闲内存的可用 CUDA 设备，按 Tab 键进行多选作业使用 `nohup` 启动，以轮询（round-robin）方式跨所选设备分配，并记录到 `logs/_.log`。对于 `defense`，circuit heads 会自动从每个模型的 `circuit.json` 中解析。需要使用 `fzf`（`conda install -c conda-forge fzf`）。请从仓库根目录运行。 ### 手动启动使用单个命令端到端运行完整的 pipeline： ``` mkdir -p logs CUDA_VISIBLE_DEVICES=0 nohup ablation run > logs/.log 2>&1 & ``` 添加 `--defense` 以在 circuit 发现之后运行防御评估（judge 模式）： ``` CUDA_VISIBLE_DEVICES=0 nohup ablation run --defense > logs/.log 2>&1 & ``` 在两张 GPU 上并行运行两个模型： ``` mkdir -p logs CUDA_VISIBLE_DEVICES=0 nohup ablation run > logs/.log 2>&1 & CUDA_VISIBLE_DEVICES=1 nohup ablation run > logs/.log 2>&1 & ``` ### 实用工具 ``` ablation similarity # cross-layer head similarity matrix ablation info # model metadata ablation data inspect # dataset summary ablation models check # verify model loads correctly ``` 各个步骤也可以单独运行——这对于重新运行单个步骤或调试非常有用： ### 步骤 1 — 校准计算干净 prompt 上的平均 `hook_z` 激活值。用作消融基线。 ``` ablation calibrate ``` ### 步骤 2 — 基线识别模型在哪些数据集对上遵循了注入。过滤以获取用于扫描的 `working_pairs`。 ``` ablation baseline ``` ### 步骤 3 — 扫描对所有 working pairs × 所有 attention heads 进行完整的 head 消融扫描。 ``` ablation sweep ``` 每条记录被分类为： - `INJECTION_SPECIFIC` — 消融阻断了注入，保留了良性输出 - `GENERAL_DISRUPTION` — 消融阻断了注入，但也破坏了良性输出 - `INJECTED` — 消融未能阻断注入 - `JUDGE_ERROR` — judge API 不可用 ### 步骤 4 — 分析计算每个 head 的特定率、二项式 p 值，并生成热力图。 ``` ablation analyze ``` ### 步骤 5 — Logit 排序根据各个 working pair 的平均 logit delta 对所有 head 进行排序。为二项式特定率提供补充信号。 ``` ablation logit-rank ``` ### 步骤 6 — 候选者从特定率排序和 logit delta 排序的交集中选出候选 head。识别用于下游分析的主要 head 和 circuit 池。 ``` ablation candidates ``` ### 步骤 7 — DLA 对主要 head 进行直接 Logit 归因（DLA）。如果省略 `--layer`/`--head`，则会从 `candidates.json` 中自动选择 head。 ``` ablation dla ``` ### 步骤 8 — Attention 模式主要 head 的 Attention 权重热力图和最后一个 token 的柱状图。 ``` ablation attn ``` ### 步骤 9 — 激活补丁三条件因果充分性测试（最终 token / 注入区间 / 所有位置）。 ``` # Single-head (自动选择 primary head) ablation act-patch # Multi-head (ENSEMBLE circuits) ablation act-patch --heads L14H23,L12H12,L11H27,L0H3 ``` ### 步骤 10 — 早期位置补丁在注入区间内进行逐 token 扫描，以识别哪些位置驱动了因果信号。 ``` ablation earlier-patch ``` ### 步骤 11 — 合法控制测量在良性 prompt 上进行消融时的 logit 保留率和响应匹配率。 ``` ablation legit-ctrl ``` ### 步骤 12 — 探针评估所有 6 个覆盖动词（IGNORE, OVERRIDE, FORGET, DISREGARD, CANCEL, STOP）下的注入阻断率。 ``` # Single primary head (默认) ablation probe # Top-N circuit heads 同时 ablated ablation probe --n-heads 4 # Explicit head list ablation probe --heads L14H23,L12H12,L11H27,L0H3 ``` ### 步骤 13 — Circuit 测试从候选池构建的多 head circuit 组合（1→4 个 head）。报告每种配置的阻断率，并将 circuit 分类为 HUB_SPOKE、ENSEMBLE 或 PLATEAU。 ``` # Fast logit mode ablation circuit # Paper-quality judge mode (与 sweep 使用相同 metric) ablation circuit --use-judge ``` ### 防御评估（可选）使用 judge 将所有 working pairs 通过完整的消融 circuit 进行处理。记录每个 pair 的基线和消融后模型输出，以便手动检查模型在消融前后输出的内容。 ``` # 默认：top 4 candidate heads ablation defense # Explicit head list ablation defense --heads L14H23,L12H12,L11H27,L0H3 ``` ## 输出文件结果按模型整理在 `results//` 下： | 文件 | 描述 | |---|---| | `calibration/mean_hook_z.pt` | 平均 hook_z 基线（torch 字典） | | `sweep/baseline.json` | 基线验证结果 + working pairs | | `sweep/sweep.ndjson` | 完整扫描结果（NDJSON，每条记录对应一个 head×pair） | | `analysis/stats.json` | 每个 head 的 specific_rate 和二项式 p 值 | | `analysis/stats_heatmap.png` | 注入阻断率热力图 | | `analysis/similarity/similarity_matrix.npy` | 跨层 head 相似度矩阵 | | `analysis/similarity/similarity_plot.png` | 相似度热力图 | | `analysis/logit_ranking.json` | 每个 head 的平均 logit delta 排序 | | `analysis/candidates.json` | 候选 head + 主要 head 选择 | | `analysis/dla/dla_L{l}H{h}.json` | 每个偏移位置的 DLA 结果 | | `analysis/dla/dla_L{l}H{h}.png` | DLA 图表 | | `analysis/attn/attn_heatmap_L{l}H{h}_*.png` | 每个 pair 的 attention 热力图 | | `analysis/attn/attn_last_token_L{l}H{h}.png` | 最后一个 token 的汇总 attention 柱状图 | | `analysis/act_patch/act_patch_L{l}H{h}.json` | 激活补丁分类结果 | | `analysis/act_patch/act_patch_L{l}H{h}.png` | 补丁汇总图 | | `analysis/earlier_patch/earlier_patch_L{l}H{h}.json` | 逐偏移因果下降 | | `analysis/earlier_patch/earlier_patch_L{l}H{h}.png` | 早期补丁柱状图 | | `analysis/legit_ctrl/legit_ctrl_L{l}H{h}.json` | 保留率和响应匹配 | | `analysis/legit_ctrl/legit_ctrl_L{l}H{h}.png` | Logit delta 分布图 | | `analysis/probe/probe_L{l}H{h}.json` | 逐动词的阻断率 | | `analysis/probe/probe_L{l}H{h}.png` | 探针柱状图 | | `analysis/circuit/circuit.json` | 多 head circuit 阻断率 + circuit 类型 | | `analysis/defense/defense_{label}.json` | 包含基线和消融后输出的每对防御结果 | 跨模型比较输出写入 `results/compare/`： | 文件 | 描述 | |---|---| | `compare/circuit_scaling.png` | 按 circuit 大小统计的最佳阻断率，每个模型一条曲线 | | `compare/head_positions.png` | 跨模型的候选 head 相对层深度散点图 | | `compare/defense_summary.png` | 每个模型经 judge 验证的阻断率 + GENDISR 率 | | `compare/summary.csv` | 每个模型一行 — 主要 head、circuit 类型、阻断率、防御结果 | ## 跨模型比较在多个模型上运行 pipeline（以及可选的防御评估）后，使用以下命令生成比较图： ``` python scripts/compare.py ``` 不会加载任何模型权重 — 仅读取磁盘上已存在的 JSON 结果文件。输出将存入 `results/compare/`。 ## 引用如果您在研究中使用此代码库，请引用： ``` @misc{kemnitzer2026mechinterp, title = {The Injection Circuit: A Mechanistic Interpretability Study of Prompt Injection in Instruction-Tuned Language Models}, author = {Kemnitzer, Jonas}, year = {2026}, note = {Preprint} } ``` ### 致谢本项目建立在 [TransformerLens](https://github.com/TransformerLensOrg/TransformerLens) 的基础之上，用于进行模型内部访问（hook、激活缓存和补丁）。 ``` @misc{nanda2022transformerlens, title = {TransformerLens}, author = {Nanda, Neel and Bloom, Joseph}, year = {2022}, howpublished = {\url{https://github.com/TransformerLensOrg/TransformerLens}} } ```

标签：DLL 劫持, Python, TransformerLens, Vectored Exception Handling, 大语言模型, 无后门, 机制可解释性, 模型可解释性, 注意力机制, 逆向工具