TheAuditorTool/BenchProctor
GitHub: TheAuditorTool/BenchProctor
Stars: 6 | Forks: 2
# BenchProctor
**Ground truth for SAST.** An open, machine-verifiable benchmark corpus for measuring how
accurately a static analysis tool finds real vulnerabilities — and how often it flags safe code.
**[benchproctor.com](https://benchproctor.com)** · [blog](https://blog.benchproctor.com) · Apache-2.0
A SAST tool is only as trustworthy as its accuracy, and accuracy is unmeasurable without ground
truth. BenchProctor gives you labeled corpora — programs marked `vulnerable` or `safe` — so you
can score any tool that emits SARIF 2.1.0 and get a real number: true-positive rate,
false-positive rate, and overall detection accuracy (Youden's J).
## Quick start
# 1. run your scanner against a corpus, export SARIF 2.1.0
your-tool scan ./corpus --format sarif -o results.sarif
# 2. score against the answer key (standard-library Python, zero dependencies)
python scripts/score_sarif.py results.sarif corpus/expectedresults-*.csv
# 3. read TPR, FPR, and your Youden's J — category-averaged and flat aggregate
## Why another benchmark
Existing public SAST benchmarks share three structural weaknesses:
- **Hand-authored and frozen.** A fixed set of human-written cases gets published once and never
changes, so tools — and the models behind them — overfit to it. A high score stops meaning
real-world accuracy.
- **The filename leaks the answer.** When a test lives at `sqli/Test01729_true_positive.java`, a
scanner can score well by matching the path, not by analyzing code.
- **One language, one file, no defenses.** Real findings cross files, services, and languages and
sit next to sanitizers that almost work. Single-file, single-language suites never exercise that.
BenchProctor is built to remove all three.
## What's in the corpus
| | |
|---|---|
| **Languages** | 9 — Python, Go, Java, JavaScript, TypeScript, PHP, Ruby, Bash, Rust |
| **Frameworks** | 18 — at least two per language, real idioms (DTOs, Pydantic models, ORM calls) |
| **Categories** | 234 |
| **Unique CWEs** | 219 |
| **OWASP Top 10 2025** | 213 / 249 mapped CWEs (85.5%) |
| **Balance** | ~50 / 50 vulnerable / safe |
- **Combinatorial, not hand-written.** Each category is a vulnerability class expressed as a taint
flow over four axes — where untrusted input enters (**source**), how it travels (**propagator**),
what would neutralize it (**sanitizer**), and the dangerous call it reaches (**sink**). The corpus
is assembled by combining those building blocks (42 sources × 40 propagators × 65 sanitizers ×
58 sinks): a vulnerable case omits an effective sanitizer; its safe twin applies one. Every
emitted combination is constrained to a realistic flow.
- **Anti-leakage by construction.** Emitted files carry no comments, no CWE tags, no category names,
and no hints in identifiers. File IDs are shuffled, so a filename reveals nothing about a file's
category or label. The CSV answer key is the only ground truth.
- **Quarterly rotation.** Each release is generated from a fixed seed that changes *which*
combinations are emitted — so the actual code differs every quarter — while holding every
scoring-relevant invariant constant (CWE identity, difficulty distribution, 50/50 balance,
language/framework coverage). Same seed reproduces the corpus byte-for-byte; a new seed yields
fresh variants you can't have pre-trained against. Last quarter's score stays comparable.
## Beyond single files — the test shapes
Detecting a bare `eval(input)` is table stakes. The corpus is weighted toward the findings that
separate a real analyzer from a pattern matcher:
## Polyglot coverage
| Language | Frameworks |
|---|---|
| Python | Flask, Django, FastAPI |
| Go | net/http, Gin |
| Java | Spring, Jakarta EE |
| JavaScript | Express, Koa |
| TypeScript | NestJS, Express |
| PHP | Laravel, Symfony |
| Ruby | Rails, Sinatra |
| Bash | standalone |
| Rust | Actix-web, Axum |
Adding a language changes nothing about the categories, so coverage stays uniform across the matrix.
## OWASP Top 10 2025
| Category | Covered / Mapped | |
|---|---|---|
| A01 Broken Access Control | 37 / 40 | 92% |
| A02 Security Misconfiguration | 11 / 16 | 69% |
| A03 Software Supply Chain | 0 / 6 | composition analysis, not code-pattern SAST |
| A04 Cryptographic Failures | 30 / 32 | 94% |
| A05 Injection | 31 / 37 | 83% |
| A06 Insecure Design | 27 / 39 | 69% |
| A07 Authentication Failures | 34 / 36 | 94% |
| A08 Software & Data Integrity | 8 / 14 | 57% |
| A09 Logging & Alerting Failures | 5 / 5 | 100% |
| A10 Exceptional Conditions | 22 / 24 | 92% |
213 of 249 mapped CWEs (85.5%). The remainder is config-level, supply-chain, or runtime-only — not
expressible as a static code pattern.
## Scoring
Every test case carries a ground-truth label (`vulnerable` or `safe`) in a CSV answer key. After a
tool runs, scoring computes a confusion matrix and one subtraction:
detected ignored
vulnerable TP FN
safe FP TN
TPR = TP / (TP + FN) detection rate
FPR = FP / (FP + TN) false-alarm rate
J = TPR - FPR Youden's J (the score)
| Score | Meaning |
|------:|---------|
| +100% | Perfect — catches everything, zero false alarms |
| 0% | No better than guessing (where a flag-everything tool lands on a 50/50 corpus) |
| -100% | Inverted — flags safe code, misses real bugs |
Scores are reported two ways: **category-averaged** (each category weighted equally so large
categories can't dominate — the headline number) and **flat aggregate**. Any tool that emits SARIF
2.1.0 can be scored; the scorer is a single standard-library Python file with no dependencies.
## Machine-verifiable ground truth
Every corpus ships a **proof manifest** — one record per file naming its exact source, propagator,
sanitizer, sink, difficulty, the sink's line number, and a SHA-256 of the file — so any label can be
independently audited from metadata alone. A bundled **self-test SARIF** scores a perfect Youden's J
against the answer key, proving the labels and the scorer agree before the benchmark can mislead you.
## Releases
Corpora are versioned and released quarterly. The scorer in `scripts/score_sarif.py` is
standard-library Python only — clone, point it at a corpus and your SARIF, and read your number.
## License
Apache License 2.0 — see [LICENSE](LICENSE). Created and maintained by the author of BenchProctor.