brandon-behring/prompt-injection-portfolio
GitHub: brandon-behring/prompt-injection-portfolio
Stars: 0 | Forks: 0
# prompt-injection-portfolio
## 3 ways to read this work
Per Round 17 architecture, the portfolio ships **three peer-level guides**
in `book/src/content/`, each targeting a different reader. All three share
the same experiment-record substrate (fragments per Round 17 follow-up Q2).
| Guide | Audience | TOC | Status |
|---|---|---|---|
| **Textbook** (`/textbook/[slug]`) | Practitioners learning prompt-injection detection methodology | 13 chapters / 4 parts / KF triadic R/O/E | ✓ skeletons shipped; prose fills M1-M7; ratifies at v0.7.0 |
| **Narrative** (`/narrative/[slug]`) | Curious engineers + recruiters who skim story-form writing | Setup → 6 climb attempts → resolution; heavy cross-chapter threading | ⏳ ships at v0.8.0 (~month 13) |
| **Academic IMRaD** (`/academic/[slug]`) | Researchers + reviewers who want compressed journal-paper flow | Introduction → Background → Methods → Results (6 lanes) → Discussion → Future Work | ⏳ ships at v0.9.0 (~month 14) |
## The problem
Prompt-injection detectors routinely report 99% accuracy on held-out splits
of training-adjacent data but collapse under realistic out-of-distribution
evaluation. The submission predecessor at
[brandon-behring/prompt-injection-detection-prototype](https://github.com/brandon-behring/prompt-injection-detection-prototype)
demonstrated this honestly: fine-tuning ModernBERT on direct-injection-heavy
LODO data made the indirect/agentic OOD slice **worse** than the
frozen-probe baseline (BIPIA AUPRC: LoRA 0.293 · frozen-probe 0.364 ·
prevalence 0.374). **Fine-tuning consumed the OOD generalization budget.**
The submission's v1.1.2 DeBERTa-v3-base ablation extended this finding:
chunk_and_average (0.2912) ≈ head_truncation (0.2895) on pooled OOD —
the wall is **not** context-window-driven. It's *backbone-invariant* across
ModernBERT and DeBERTa.
**The wall exists. This repo asks: *can we climb it?***
## Why it matters
Real-world consequences:
- **EchoLeak** (CVE-2025-32711, June 2025) was the first publicly
documented zero-click indirect-injection exploit in production
(Microsoft 365 Copilot). It bypassed Microsoft's XPIA classifier +
Markdown link redaction + Content Security Policy.
- The **"Are Firewalls All You Need?"** critique
([Bhagwatkar et al. NeurIPS 2025](https://arxiv.org/abs/2510.05244))
shows current agentic benchmarks (AgentDojo, InjecAgent, ASB, τ-Bench)
are saturated by simple two-firewall defenses — exposing how detector
evaluation can mislead.
- The **CodeIntegrity "98% post-mortem"** (Jan 2026) is the most-cited
industry self-critique: *"98% on historical data ≠ 98% on tomorrow's
attacks. Treat your 98% detector as a speed bump, not a wall."*
Detection alone is structurally insufficient — but understanding *why* is
necessary before architectural defenses can be evaluated honestly.
## Our approach
This portfolio extends the submission as the **prototype**. Five lanes
plus a 12-technique adversarial-robustness lane will climb the wall via
complementary methodologies — each a controlled experiment producing
positive or negative evidence:
| Lane | Question | Milestone |
|---|---|---|
| **1** | Direct-injection baseline + Tier B reference scorers (Meta PG2 86M, ProtectAI v1/v2) — is ModernBERT competitive with contemporary SOTA encoders? | M1 |
| **1b** | Full 12-technique character-injection adversarial robustness + CourtGuard multi-agent baseline | M1 |
| **2** | Indirect-injection training data + 2-variant loss ablation (CE baseline + Recall@LowFPR per Meta PG2 recipe) — does new data overcome backbone-invariance? | M2-M4 |
| **3** | RAG-injection live demo + 3-variant Spotlighting toggle (delimit + datamark + encoding) | M5 |
| **4** | Agentic harness + score fusion stacker + adaptive eval (5K LLMail-Inject + PINT-EN 3016) | M6 |
| **5** | TaskTracker activation probe (encoder vs decoder methodology port test) | M2 + M7 |
The book at `book/` (Astro+MDX, Cloudflare Pages — bootstraps at M0 Day 14
once scaffold v3.2 ships) is the field log. Chapters carry freshness
badges indicating maturity (`exploratory` → `experimental-result` →
`locked`).
## What we found
*This section fills in per-milestone as lanes close. Latest results will
link to `evals/` and `book/src/content/chapters/`. At v0.1.0-pre (current),
no lane results have shipped — see plan §9 milestone sequence.*
## Reproduce + read
**Reproducibility ladder** (per ADR-018 + Round 2 Q2'):
- **T0** (eval-from-hub; ~15 min on a laptop, $0): `scripts/eval_from_hub.py`
— portfolio-owned clean reimplementation per Round 6 Q1''''' (ADR-035).
Lands at M0 Day 1 / Day 17.
- **T1** (full retrain blueprint; ~18 GPU-h × variant; runpod-deploy):
`scripts/retrain_blueprint.py` — for researchers with GPU budget.
- **T2** (Docker; cross-machine portability): `Dockerfile` + `compose.yaml`
— lands at M0 Day 16.
- **T3** (selective notebooks): ~5-6 jupytext-paired notebooks at
`book/src/content/notebooks/` for Ch 5 (bootstrap walkthrough), Ch 6
(threshold policy), Ch 8 (12-technique bypass matrix), Ch 9 (Lane 2
attribution table), Ch 11 (stacker analysis), Ch 12 (activation probe).
**Methodology pointers**:
- Library-first discipline: `decisions/library_imports.md` (lands M0 Day 5)
- ADR governance: `decisions/ADR-*.md` (~30-32 ADRs anticipated)
- Research dossier: `docs/research/` (60-80 files at M0 close)
- Experiment records: `experiments/lane-N-*/{hypothesis,protocol,results,decisions}.md`
**Build-in-public**: weekly Twitter/Mastodon thread + monthly deep-dive
blog post per Round 3 Q4''. Archive at `docs/build-in-public/`.
## License + AI assistance
- **Code**: Apache-2.0 (this file's LICENSE).
- **Book + prose**: CC-BY-4.0 (separate `book/LICENSE` at M0 Day 2+).
- **Citation**: see `ETHICS.md` §5 for BibTeX.
- **AI assistance**: this project was developed in collaboration with
Claude (Anthropic). See `ETHICS.md` §4 and (when book bootstraps) the
book frontmatter AI-disclosure for full details. Detailed per-commit
attribution is preserved via `Co-Authored-By: Claude` git trailers.
## Status + roadmap
| Milestone | Date | Tag | Status |
|---|---|---|---|
| **M0 Day 1** | 2026-05-19 | (seed) | ✓ pre-flight green; repo public |
| **M0 Day 2** | 2026-05-19 | — | ✓ book scaffold + uv pyproject + CI workflow |
| **M0 Day 2.5** | 2026-05-19 | — | ✓ 9 upstream MR issues filed |
| **M0 Day 3a** | 2026-05-22 | — | ✓ Round 20/21/22 pin cascade (eval-toolkit v0.47 + scaffold v3.5 + submission v1.3.0); 6 of 9 MRs closed upstream |
| **M0 Day 3b** | 2026-05-22 | `v0.1.0-pre` | ✓ 7 test-contracts + CI hard-gates green |
| **M0 Day 5 + 14 + 16** | 2026-05-22 | — | ✓ lane skeletons + chapter skeletons + Docker T2 |
| **M0 close** | TBD-week-3 | `v0.1.0` | pending: dossier 60-80 (user-led) + Day 15 governance + Day 17 ADRs + Day 18 templates + Day 19 ratify |
| M1 Lane 1 + 1b | TBD-week-4-5 | `v0.2.0` | pending |
| M7 close (textbook ratify) | TBD-week-13-14 | `v0.7.0` | pending |
| v0.8.0 (narrative ship) | TBD-~month-13 | `v0.8.0` | pending |
| v0.9.0 (academic IMRaD ship) | TBD-~month-14 | `v0.9.0` | pending |
| v1.0.0 cutover (all 3 polished) | ~month 16-17 | `v1.0.0` | pending |
Plan + companion docs:
- Plan: `/home/brandon_behring/.claude/plans/i-want-to-consider-merry-milner.md`
- Chapter outlines: `/home/brandon_behring/.claude/plans/portfolio-chapter-outlines.md`
- Experiment record template: `/home/brandon_behring/.claude/plans/portfolio-experiment-record-template.md`
- Lane execution playbooks: `/home/brandon_behring/.claude/plans/portfolio-lane-execution-playbooks.md`
(These plan + companion docs are private; portfolio's public ADR + dossier
+ chapter prose will mirror the decisions at M0 close.)