jkemnitzer/mechinterp-injection-suite

GitHub: jkemnitzer/mechinterp-injection-suite

Stars: 1 | Forks: 0

mechinterp-injection-suite # mechinterp-injection-suite **Mechanistic interpretability pipeline for prompt injection research** [![Python](https://img.shields.io/badge/python-3.11-blue)](https://www.python.org/) [![TransformerLens](https://img.shields.io/badge/TransformerLens-v2-orange)](https://github.com/neelnanda-io/TransformerLens) [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE) [![HuggingFace](https://img.shields.io/badge/🤗-HuggingFace-yellow)](https://huggingface.co/) [![Pydantic](https://img.shields.io/badge/Pydantic-v2-red)](https://docs.pydantic.dev/) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![CUDA](https://img.shields.io/badge/CUDA-enabled-76b900?logo=nvidia)](https://developer.nvidia.com/cuda-toolkit)
## Setup ### 1. Environment conda env create -f environment.yml conda activate lab pip install -e . ### 2. Configuration Copy `.env.example` to `.env` and fill in your credentials: cp .env.example .env | Variable | Description | |---|---| | `HF_TOKEN` | HuggingFace access token | | `HF_HOME` | Path to model weight cache (e.g. `/data/models`) | | `ABLATION_DEVICE` | Device override: `cuda`, `mps`, or `cpu` (auto-detects if unset) | | `JUDGE_API_KEY` | API key for LLM judge | | `JUDGE_API_BASE` | Base URL for judge API (e.g. `https://.../api/v1`) | | `JUDGE_MODEL` | Judge model ID (e.g. `lisa-flash-03-2026`) | ## Available models | Key | Model | |---|---| | `gemma2-2b-it` | google/gemma-2-2b-it | | `gemma2-9b-it` | google/gemma-2-9b-it | | `gemma-3-4b-it` | google/gemma-3-4b-it | | `llama3-8b` | meta-llama/Meta-Llama-3-8B-Instruct | | `llama3.1-8b` | meta-llama/Llama-3.1-8B-Instruct | | `mistral-7b` | mistralai/Mistral-7B-Instruct-v0.1 | | `qwen2-7b` | Qwen/Qwen2-7B-Instruct | | `qwen2.5-7b` | Qwen/Qwen2.5-7B-Instruct | | `phi-3-mini` | microsoft/Phi-3-mini-4k-instruct | | `phi-4` | microsoft/phi-4 | ## Pipeline flowchart TD classDef setup fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f classDef discovery fill:#dcfce7,stroke:#22c55e,color:#14532d classDef characts fill:#f3e8ff,stroke:#a855f7,color:#3b0764 classDef io fill:#fff7ed,stroke:#f97316,color:#7c2d12 INPUT(["`INPUT Instruction-tuned LLM Dataset: 100 injection pairs`"]):::io subgraph PHASE1["⚙️ PHASE 1 — Setup"] CAL["[1] CALIBRATE Forward pass on 50 clean prompts → mean_hook_z.pt"]:::setup BAS["[2] BASELINE Run all 100 pairs → working_pairs.json"]:::setup CAL --> BAS end subgraph PHASE2["🔍 PHASE 2 — Discovery"] SWE["[3] SWEEP Ablate every head individually → sweep.ndjson"]:::discovery ANA["[4] ANALYZE Binomial test per head → stats.json + heatmap"]:::discovery LR["[5] LOGIT-RANK Mean logit delta per head → logit_ranking.json"]:::discovery CAN["[6] CANDIDATES Intersect p-value ∩ logit rank → PRIMARY HEADS"]:::discovery SWE --> ANA --> LR --> CAN end subgraph PHASE3["🔬 PHASE 3 — Characterization"] DLA["[7] DLA Direct logit attribution"]:::characts ATT["[8] ATTN Attention pattern analysis"]:::characts AP["[9] ACT-PATCH Causal sufficiency · relay gap"]:::characts EP["[10] EARLIER-PATCH Relay position in span"]:::characts LC["[11] LEGIT-CTRL Benign generation control"]:::characts PRB["[12] PROBE Generalisation to new verbs"]:::characts CIR["[13] CIRCUIT Multi-head minimal circuit"]:::characts MAP["[14] ACT-PATCH --multi-head Full causal sufficiency of circuit"]:::characts CIR -->|ENSEMBLE| MAP end OUT(["`OUTPUT Primary head · Relay gap Circuit size · GENDISR = 0`"]):::io INPUT --> CAL BAS -->|working pairs| SWE CAN -->|primary head| DLA & ATT & AP & EP & LC & PRB & CIR DLA & ATT & AP & EP & LC & PRB & MAP --> OUT Run the full pipeline end-to-end with a single command: mkdir -p logs CUDA_VISIBLE_DEVICES=0 nohup ablation run > logs/.log 2>&1 & To run two models in parallel across two GPUs: mkdir -p logs CUDA_VISIBLE_DEVICES=0 nohup ablation run > logs/.log 2>&1 & CUDA_VISIBLE_DEVICES=1 nohup ablation run > logs/.log 2>&1 & ### Utilities ablation similarity # cross-layer head similarity matrix ablation info # model metadata ablation data inspect # dataset summary ablation models check # verify model loads correctly Individual steps can also be run separately — useful for re-running a single step or debugging: ### Step 1 — Calibrate Compute mean `hook_z` activations over clean prompts. Used as the ablation baseline. ablation calibrate ### Step 2 — Baseline Identify which dataset pairs the model follows the injection on. Filters to `working_pairs` used in the sweep. ablation baseline ### Step 3 — Sweep ablation sweep Each record is classified as: - `INJECTION_SPECIFIC` — ablation blocks injection, benign output preserved - `GENERAL_DISRUPTION` — ablation blocks injection but also disrupts benign output - `INJECTED` — ablation did not block injection - `JUDGE_ERROR` — judge API unavailable ### Step 4 — Analyze Compute per-head specific rates, binomial p-values, and generate heatmap. ablation analyze ### Step 5 — Logit rank ablation logit-rank ### Step 6 — Candidates ablation candidates ### Step 7 — DLA Direct Logit Attribution for the primary head. Auto-selects head from `candidates.json` if `--layer`/`--head` omitted. ablation dla ### Step 8 — Attention patterns Attention weight heatmaps and last-token bar chart for the primary head. ablation attn ### Step 9 — Activation patching Three-condition causal sufficiency test (final token / injection span / all positions). # Single-head (auto-selects primary head) ablation act-patch # Multi-head (ENSEMBLE circuits) ablation act-patch --heads L14H23,L12H12,L11H27,L0H3 ### Step 10 — Earlier-position patching Per-token sweep within the injection span to identify which positions drive the causal signal. ablation earlier-patch ### Step 11 — Legitimate control Measures logit preservation and response match rate under ablation on benign prompts. ablation legit-ctrl ### Step 12 — Probe Evaluates injection blocking rate across all 6 override verbs (IGNORE, OVERRIDE, FORGET, DISREGARD, CANCEL, STOP). ablation probe ### Step 13 — Circuit # Fast logit mode ablation circuit # Paper-quality judge mode (same metric as sweep) ablation circuit --use-judge ## Output files Results are organised per model under `results//`: | File | Description | |---|---| | `calibration/mean_hook_z.pt` | Mean hook_z baseline (torch dict) | | `sweep/baseline.json` | Baseline validation results + working pairs | | `sweep/sweep.ndjson` | Full sweep results (NDJSON, one record per head×pair) | | `analysis/stats.json` | Per-head specific_rate and binomial p-values | | `analysis/stats_heatmap.png` | Injection blocking rate heatmap | | `analysis/similarity/similarity_matrix.npy` | Cross-layer head similarity matrix | | `analysis/similarity/similarity_plot.png` | Similarity heatmap | | `analysis/logit_ranking.json` | Per-head mean logit delta ranking | | `analysis/candidates.json` | Candidate heads + primary head selection | | `analysis/dla/dla_L{l}H{h}.json` | DLA results per offset position | | `analysis/dla/dla_L{l}H{h}.png` | DLA plot | | `analysis/attn/attn_heatmap_L{l}H{h}_*.png` | Per-pair attention heatmaps | | `analysis/attn/attn_last_token_L{l}H{h}.png` | Last-token aggregate attention bar chart | | `analysis/act_patch/act_patch_L{l}H{h}.json` | Activation patching classification results | | `analysis/act_patch/act_patch_L{l}H{h}.png` | Patching summary plot | | `analysis/earlier_patch/earlier_patch_L{l}H{h}.json` | Per-offset causal drop | | `analysis/earlier_patch/earlier_patch_L{l}H{h}.png` | Earlier-patch bar chart | | `analysis/legit_ctrl/legit_ctrl_L{l}H{h}.json` | Preservation rate and response match | | `analysis/legit_ctrl/legit_ctrl_L{l}H{h}.png` | Logit delta distribution plot | | `analysis/probe/probe_L{l}H{h}.json` | Per-verb blocking rates | | `analysis/probe/probe_L{l}H{h}.png` | Probe bar chart | | `analysis/circuit/circuit.json` | Multi-head circuit blocking rates + circuit type | ## Citation If you use this codebase in your research, please cite: @misc{kemnitzer2026mechinterp, title = {The Injection Circuit: A Mechanistic Interpretability Study of Prompt Injection in Instruction-Tuned Language Models}, author = {Kemnitzer, Jonas}, year = {2026}, note = {Preprint} }