rosscyking1115/llm-redteam-harness
GitHub: rosscyking1115/llm-redteam-harness
Stars: 0 | Forks: 0
# llm-redteam-harness
[](https://github.com/rosscyking1115/llm-redteam-harness/actions/workflows/ci.yml)
[](./LICENSE)
[](https://www.python.org/downloads/release/python-3130/)
## Headline finding
Across **2 target models**, **2 benchmark families**, and up to **4 composable
defence configurations** — 12 evaluation cells in total — published adversarial
prompts succeed between **0% and 4%** of the time, and a paranoid prompt-only
defence stack does **not** measurably move that number.
Every attack-success verdict is scored by an LLM judge and then independently
re-scored by a second judge model. Judge agreement on attack success is
**perfect — Cohen's κ = +1.00 in all 12 cells.**
The honest reading: 2026-era instruction tuning already neutralises these
published, *static* attacks on both a frontier model (Claude Sonnet 4.6) and a
small local model (Llama 3.1 8B). Prompt-only defences change the *style* of a
refusal, not the safety outcome. The open risk these static benchmarks
under-measure is the **full agentic loop** — interactive, multi-step tool use —
which is named explicitly as future work, not quietly omitted.

### AdvBench — direct attacks (n = 100 per cell)
| Target | Baseline ASR | Full-stack ASR |
| --- | ---: | ---: |
| Claude Sonnet 4.6 | 0% [0, 0] | 0% [0, 0] |
| Llama 3.1 8B (local) | 1% [0, 3] | 0% [0, 0] |
### AgentDojo — static indirect injection (n = 50 per cell)
| Target | Baseline | + Spotlighting | + SecAlign | Full prompt stack |
| --- | ---: | ---: | ---: | ---: |
| Claude Sonnet 4.6 | 0% | 0% | 0% | 0% |
| Llama 3.1 8B (local) | 4% [0, 10] | 0% | 0% | 0% |
All figures are attack success rate (ASR) from the LLM judge; brackets are 95%
percentile-bootstrap confidence intervals. Cross-judge κ on ASR = +1.00 for
every cell. Full numbers, validation, and limits in
[`METHODOLOGY.md`](./METHODOLOGY.md).
## Why this reports ASR and not refusal rate
The harness was built to report how trustworthy its own metrics are. The
cross-judge layer found that **ASR is well-posed and `refusal_rate` is not**:
the two judges agree perfectly on whether an attack succeeded, but disagree —
sometimes worse than chance — on whether a response was a "refusal", because an
indirect-injection task has two things that can be refused (the user's request
and the injected instruction). `refusal_rate` is therefore reported as a
*descriptive* signal of response style only. This is documented, not hidden —
see [`METHODOLOGY.md`](./METHODOLOGY.md) §7.
## What this is
A reproducible benchmark that:
## Status
The v1 evaluation matrix is complete and the harness is feature-complete for
v1 scope. The next tracks — the full AgentDojo agent loop, multi-turn attacks,
and Inspect AI export — are listed in [`METHODOLOGY.md`](./METHODOLOGY.md) §12.
## Getting started
git clone https://github.com/rosscyking1115/llm-redteam-harness.git
cd llm-redteam-harness
uv venv --python 3.13
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows PowerShell
uv pip install -e ".[dev]"
cp .env.example .env # fill in ANTHROPIC_API_KEY
pre-commit install
pytest tests/unit # should pass green
python -m redteam version # prints 0.1.0
The CLI is invoked as `python -m redteam ...`. Each run enforces a hard USD
budget cap (set per config in `configs/`); set a matching console cap before
the first run.
python -m redteam corpora download # fetch + pin corpora
python -m redteam run --config configs/run_anthropic_baseline.yaml
python -m redteam score --run results/.json
python -m redteam cross-judge --run results/.judged.json
## Inspect AI compatibility
Any run exports to a [UK AI Security Institute Inspect](https://inspect.aisi.org.uk/)
eval log, so results open in `inspect view` or load with `read_eval_log()`:
uv pip install -e ".[inspect]" # optional extra
python -m redteam export-inspect --run results/.json
Each case maps to an Inspect sample scored by the LLM-judge ASR verdict;
cross-judge agreement, confidence intervals, and cost travel in the metadata.
## Running CI locally before pushing
`scripts/ci_local.ps1` (Windows) and `scripts/ci_local.sh` (Linux/macOS) run
the **exact** same checks as `.github/workflows/ci.yml` — ruff lint, ruff
format check, mypy, pytest. Activate the venv, then:
scripts\ci_local.ps1
If it exits green, CI on the PR will be green too.
## Repository layout
See `PROJECT-1-KIT.md` §6 for the target layout. `src/redteam/` holds the
schemas, corpus loaders, target adapters, defences, orchestrator, scorers, and
CLI; `configs/` holds run configs and pinned dataset versions; `results/` holds
run artifacts (gitignored — re-creatable from configs).
## Ethics
This project uses **only** published adversarial prompts. Excluded categories
(CSAM, weapons-of-mass-destruction synthesis, detailed self-harm methods) are
filtered at corpus-load time and verified by a CI test. Results are aggregate —
no raw harmful outputs are committed to this repo.
If you are a model provider whose model is included and want example
transcripts removed, email **rosscyking@gmail.com** — **24-hour removal
commitment**. See [`ETHICS.md`](./ETHICS.md).
## Citation
@software{llm_redteam_harness_2026,
title = {llm-redteam-harness: Reproducible LLM defence evaluation},
author = {Ross},
year = {2026},
url = {https://github.com/rosscyking1115/llm-redteam-harness}
}
## Licence
MIT — see [`LICENSE`](./LICENSE).