
# mechinterp-injection-suite
**Mechanistic interpretability pipeline for prompt injection research**
[](https://www.python.org/)
[](https://github.com/neelnanda-io/TransformerLens)
[](LICENSE)
[](https://huggingface.co/)
[](https://docs.pydantic.dev/)
[](https://github.com/astral-sh/ruff)
[](https://developer.nvidia.com/cuda-toolkit)
## Setup
### 1. Environment
conda env create -f environment.yml
conda activate lab
pip install -e .
### 2. Configuration
Copy `.env.example` to `.env` and fill in your credentials:
cp .env.example .env
| Variable | Description |
|---|---|
| `HF_TOKEN` | HuggingFace access token |
| `HF_HOME` | Path to model weight cache (e.g. `/data/models`) |
| `ABLATION_DEVICE` | Device override: `cuda`, `mps`, or `cpu` (auto-detects if unset) |
| `JUDGE_API_KEY` | API key for LLM judge |
| `JUDGE_API_BASE` | Base URL for judge API (e.g. `https://.../api/v1`) |
| `JUDGE_MODEL` | Judge model ID (e.g. `lisa-flash-03-2026`) |
## Available models
| Key | Model |
|---|---|
| `gemma2-2b-it` | google/gemma-2-2b-it |
| `gemma2-9b-it` | google/gemma-2-9b-it |
| `gemma-3-4b-it` | google/gemma-3-4b-it |
| `llama3-8b` | meta-llama/Meta-Llama-3-8B-Instruct |
| `llama3.1-8b` | meta-llama/Llama-3.1-8B-Instruct |
| `mistral-7b` | mistralai/Mistral-7B-Instruct-v0.1 |
| `qwen2-7b` | Qwen/Qwen2-7B-Instruct |
| `qwen2.5-7b` | Qwen/Qwen2.5-7B-Instruct |
| `phi-3-mini` | microsoft/Phi-3-mini-4k-instruct |
| `phi-4` | microsoft/phi-4 |
## Pipeline
flowchart TD
classDef setup fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef discovery fill:#dcfce7,stroke:#22c55e,color:#14532d
classDef characts fill:#f3e8ff,stroke:#a855f7,color:#3b0764
classDef io fill:#fff7ed,stroke:#f97316,color:#7c2d12
INPUT(["`INPUT
Instruction-tuned LLM
Dataset: 100 injection pairs`"]):::io
subgraph PHASE1["⚙️ PHASE 1 — Setup"]
CAL["[1] CALIBRATE
Forward pass on 50 clean prompts
→ mean_hook_z.pt"]:::setup
BAS["[2] BASELINE
Run all 100 pairs
→ working_pairs.json"]:::setup
CAL --> BAS
end
subgraph PHASE2["🔍 PHASE 2 — Discovery"]
SWE["[3] SWEEP
Ablate every head individually
→ sweep.ndjson"]:::discovery
ANA["[4] ANALYZE
Binomial test per head
→ stats.json + heatmap"]:::discovery
LR["[5] LOGIT-RANK
Mean logit delta per head
→ logit_ranking.json"]:::discovery
CAN["[6] CANDIDATES
Intersect p-value ∩ logit rank
→ PRIMARY HEADS"]:::discovery
SWE --> ANA --> LR --> CAN
end
subgraph PHASE3["🔬 PHASE 3 — Characterization"]
DLA["[7] DLA
Direct logit attribution"]:::characts
ATT["[8] ATTN
Attention pattern analysis"]:::characts
AP["[9] ACT-PATCH
Causal sufficiency · relay gap"]:::characts
EP["[10] EARLIER-PATCH
Relay position in span"]:::characts
LC["[11] LEGIT-CTRL
Benign generation control"]:::characts
PRB["[12] PROBE
Generalisation to new verbs"]:::characts
CIR["[13] CIRCUIT
Multi-head minimal circuit"]:::characts
MAP["[14] ACT-PATCH --multi-head
Full causal sufficiency of circuit"]:::characts
CIR -->|ENSEMBLE| MAP
end
OUT(["`OUTPUT
Primary head · Relay gap
Circuit size · GENDISR = 0`"]):::io
INPUT --> CAL
BAS -->|working pairs| SWE
CAN -->|primary head| DLA & ATT & AP & EP & LC & PRB & CIR
DLA & ATT & AP & EP & LC & PRB & MAP --> OUT
Run the full pipeline end-to-end with a single command:
mkdir -p logs
CUDA_VISIBLE_DEVICES=0 nohup ablation run
> logs/.log 2>&1 &
To run two models in parallel across two GPUs:
mkdir -p logs
CUDA_VISIBLE_DEVICES=0 nohup ablation run > logs/.log 2>&1 &
CUDA_VISIBLE_DEVICES=1 nohup ablation run > logs/.log 2>&1 &
### Utilities
ablation similarity # cross-layer head similarity matrix
ablation info # model metadata
ablation data inspect # dataset summary
ablation models check # verify model loads correctly
Individual steps can also be run separately — useful for re-running a single step or debugging:
### Step 1 — Calibrate
Compute mean `hook_z` activations over clean prompts. Used as the ablation baseline.
ablation calibrate
### Step 2 — Baseline
Identify which dataset pairs the model follows the injection on. Filters to `working_pairs` used in the sweep.
ablation baseline
### Step 3 — Sweep
ablation sweep
Each record is classified as:
- `INJECTION_SPECIFIC` — ablation blocks injection, benign output preserved
- `GENERAL_DISRUPTION` — ablation blocks injection but also disrupts benign output
- `INJECTED` — ablation did not block injection
- `JUDGE_ERROR` — judge API unavailable
### Step 4 — Analyze
Compute per-head specific rates, binomial p-values, and generate heatmap.
ablation analyze
### Step 5 — Logit rank
ablation logit-rank
### Step 6 — Candidates
ablation candidates
### Step 7 — DLA
Direct Logit Attribution for the primary head. Auto-selects head from `candidates.json` if `--layer`/`--head` omitted.
ablation dla
### Step 8 — Attention patterns
Attention weight heatmaps and last-token bar chart for the primary head.
ablation attn
### Step 9 — Activation patching
Three-condition causal sufficiency test (final token / injection span / all positions).
# Single-head (auto-selects primary head)
ablation act-patch
# Multi-head (ENSEMBLE circuits)
ablation act-patch --heads L14H23,L12H12,L11H27,L0H3
### Step 10 — Earlier-position patching
Per-token sweep within the injection span to identify which positions drive the causal signal.
ablation earlier-patch
### Step 11 — Legitimate control
Measures logit preservation and response match rate under ablation on benign prompts.
ablation legit-ctrl
### Step 12 — Probe
Evaluates injection blocking rate across all 6 override verbs (IGNORE, OVERRIDE, FORGET, DISREGARD, CANCEL, STOP).
ablation probe
### Step 13 — Circuit
# Fast logit mode
ablation circuit
# Paper-quality judge mode (same metric as sweep)
ablation circuit --use-judge
## Output files
Results are organised per model under `results//`:
| File | Description |
|---|---|
| `calibration/mean_hook_z.pt` | Mean hook_z baseline (torch dict) |
| `sweep/baseline.json` | Baseline validation results + working pairs |
| `sweep/sweep.ndjson` | Full sweep results (NDJSON, one record per head×pair) |
| `analysis/stats.json` | Per-head specific_rate and binomial p-values |
| `analysis/stats_heatmap.png` | Injection blocking rate heatmap |
| `analysis/similarity/similarity_matrix.npy` | Cross-layer head similarity matrix |
| `analysis/similarity/similarity_plot.png` | Similarity heatmap |
| `analysis/logit_ranking.json` | Per-head mean logit delta ranking |
| `analysis/candidates.json` | Candidate heads + primary head selection |
| `analysis/dla/dla_L{l}H{h}.json` | DLA results per offset position |
| `analysis/dla/dla_L{l}H{h}.png` | DLA plot |
| `analysis/attn/attn_heatmap_L{l}H{h}_*.png` | Per-pair attention heatmaps |
| `analysis/attn/attn_last_token_L{l}H{h}.png` | Last-token aggregate attention bar chart |
| `analysis/act_patch/act_patch_L{l}H{h}.json` | Activation patching classification results |
| `analysis/act_patch/act_patch_L{l}H{h}.png` | Patching summary plot |
| `analysis/earlier_patch/earlier_patch_L{l}H{h}.json` | Per-offset causal drop |
| `analysis/earlier_patch/earlier_patch_L{l}H{h}.png` | Earlier-patch bar chart |
| `analysis/legit_ctrl/legit_ctrl_L{l}H{h}.json` | Preservation rate and response match |
| `analysis/legit_ctrl/legit_ctrl_L{l}H{h}.png` | Logit delta distribution plot |
| `analysis/probe/probe_L{l}H{h}.json` | Per-verb blocking rates |
| `analysis/probe/probe_L{l}H{h}.png` | Probe bar chart |
| `analysis/circuit/circuit.json` | Multi-head circuit blocking rates + circuit type |
## Citation
If you use this codebase in your research, please cite:
@misc{kemnitzer2026mechinterp,
title = {The Injection Circuit: A Mechanistic Interpretability Study of Prompt Injection in Instruction-Tuned Language Models},
author = {Kemnitzer, Jonas},
year = {2026},
note = {Preprint}
}