gapilongo/pentest-copilot

GitHub: gapilongo/pentest-copilot

Stars: 0 | Forks: 0

# Pentest Copilot [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/d2dce17641025007.svg)](https://github.com/gapilongo/pentest-copilot/actions/workflows/ci.yml) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![Release](https://img.shields.io/github/v/release/gapilongo/pentest-copilot)](https://github.com/gapilongo/pentest-copilot/releases) [![Good first issues](https://img.shields.io/github/issues/gapilongo/pentest-copilot/good%20first%20issue?label=good%20first%20issues)](https://github.com/gapilongo/pentest-copilot/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) [![Code of Conduct](https://img.shields.io/badge/code%20of%20conduct-Contributor%20Covenant-blueviolet)](CODE_OF_CONDUCT.md) A self-hosted, single-operator pentest copilot. Answers come from curated **operator substrate** — a structured technique catalog, sequenced playbooks, freshness-aware retrieved sources, and deterministic tool calls — **not from model recall**. Every claim is either cited to a real source, drawn from a real catalog/playbook entry, or explicitly tagged as model knowledge. A verifier rejects answers that don't comply and regenerates. Authorized engagements only. No external API. Runs on one box (default: 4×A10 / 92 GB VRAM). ## Why this exists A senior pentester using a generic LLM (ChatGPT, Claude, vendor "AI assistants") on a live engagement hits four reliable failure modes: 1. **Confident hallucination** — invented CVE IDs ("ProxyLogoff"), fake tool flags, wrong CVE→product attribution 2. **Stale guidance** — 2019-era Potato-on-Win11 recommendations against modern Falcon 3. **OPSEC overstatements** — "this is undetectable" / "indistinguishable from admin work" claims with no evidence 4. **Generic textbook prose** — listing technique families without the operational specifics (exact flag, exact registry key, the `-p` on SUID bash) that distinguish "I've done this" from "I've read about this" This system addresses each structurally: - **Operator-curated substrate** beats model recall. 38 catalog entries + 16 playbooks contain hand-written commands, EKU OIDs, msPKI flags, GTFOBins exploits — the things a model hallucinates around. - **Deterministic tools** for verifiable facts (CISA KEV lookup, MITRE ID lookup, NVD CVE lookup). Tool output ships as ground truth. - **10+ regression-pattern rules** in the verifier physically block known-bad outputs (PwnKit-as-kernel-exploit, "indistinguishable from admin", SUID-bash-without-`-p`) at composition time, so a fixed bug stays fixed across regenerations and model swaps. - **Asset-shape scope filter** ignores `github.com`, `os.system`, `mysql.func`, `169.254.169.254` and similar non-target tokens that produce alarm-fatigue scope warnings on every other "AI security tool." - **LLM-as-judge eval framework** calibrated to 97% catches regressions on every code/content change; current senior-grade quality is **8.7/10** on a hand-graded 5-case oracle set. It's not magic — it's a copilot. Roughly 1 in 4 answers needs an operator edit before shipping to a junior teammate. But the bugs that get edited *don't recur*, because each one becomes a verifier rule. ## Contents 1. [What it does (concretely)](#what-it-does-concretely) 2. [System architecture](#system-architecture) 3. [Request lifecycle](#request-lifecycle) 4. [Knowability routing](#knowability-routing) 5. [Verifier + regenerate loop](#verifier--regenerate-loop) 6. [Installation](#installation) 7. [Running it](#running-it) 8. [Your first session — three example questions](#your-first-session--three-example-questions) 9. [Adding content (catalog + playbook + RAG pack)](#adding-content) 10. [Supported models + GPU requirements](#supported-models--gpu-requirements) 11. [Configuration reference](#configuration-reference) 12. [Substrate layout](#substrate-layout) 13. [Structural verifier rules](#structural-verifier-rules) 14. [Quality eval framework](#quality-eval-framework) 15. [Repository layout](#repository-layout) 16. [Troubleshooting](#troubleshooting) 17. [Honest limitations](#honest-limitations) 18. [Authorized use](#authorized-use) ## What it does (concretely) - **Pulls structured technique records** when a prompt matches one (38 catalog entries: ESC1-15, Kerberoasting, RBCD, SUID abuse, web SQLi/XSS/SSRF/XXE, GTFOBins primitives, …). Renders the entry's defining conditions, exploitation steps, detection signals, and operator-prereqs verbatim into the prompt. - **Fires multi-phase playbooks** for engagement-scenario prompts (16 playbooks: low-priv Linux/Win privesc, AD enumeration, ADCS triage, post-DCSync, external recon, web app pentest, etc.). - **Calls deterministic tools** for verifiable facts: `lookup_cve` (CISA KEV + NVD), `check_kev`, `lookup_attack_id` (MITRE), `fetch_research_blog`. Tool output ships as ground truth. - **Runs hybrid RAG** over 6 freshness-aware packs (operator_hacktricks, cve_kev, attack_mitre, nvd_cve, github_releases, research_blogs). - **Classifies knowability** of each question (`verifiable_fact` / `time_decaying_documented` / `pure_reasoning` / `unverifiable_prediction`) and applies a class-specific stance: tool-grounded for facts, source-cited for documented, honest-stop for predictions. - **Verifies every answer** against 17 structural rules before shipping; regenerates on hard failures; never returns "system rejected" — graceful fail-open with banner. - **Filters operator-visible output** against the engagement scope: redacts out-of-scope assets, queues unknown ones for review, ignores tool-repo / DB-system / cloud-metadata false positives. Current quality scores (verifiable in `docs/`): | Metric | Value | Source | |---|---|---| | Senior-graded quality (5-case oracle) | **8.7 / 10** | `docs/quality_v13_7_*.md` | | Substrate-hit % (151-prompt coverage corpus) | **92.7 %** | `docs/v13_6_qwen36_COVERAGE_MATRIX.md` | | Hard failures across 151 prompts | **1** | same | | Fake citations across 151 prompts | **0** | same | | Judge calibration pass rate | **97 %** | `docs/quality_judge_calibration.md` | ## System architecture flowchart TD U[User prompt + engagement context] --> INT[Intent classifier
src/agent/intent.py] INT --> SC[Input scope filter
asset-shape gate
src/engagement/input_scope_filter.py] SC --> KR[Knowability router
fact / doc / reasoning / prediction
src/agent/knowability_router.py] KR --> SK[Skill: answer_freeform
src/agent/skills.py] SK --> CAT[Technique catalog
match by alias + keyword
38 entries · data/technique_catalog/] SK --> PB[Playbook matcher
intent + context patterns
16 playbooks · data/playbooks/] SK --> RAG[RAG packs
6 freshness-aware corpora
data/rag/packs/] SK --> TOOL[Deterministic tools
lookup_cve · check_kev · lookup_attack_id
src/agent/deterministic_tools.py] CAT --> PROMPT[Compose system prompt
catalog block + playbook block +
RAG hits + tool definitions] PB --> PROMPT RAG --> PROMPT TOOL --> PROMPT PROMPT --> LLM[vLLM backend
Qwen3-Next-80B-A3B-FP8
4×A10 · 4K-8K ctx · hermes tool parser] LLM --> DRAFT[Model draft] DRAFT --> PV[Provenance verifier
src/agent/provenance_verifier.py

source/tool tag validity ·
10 regression patterns ·
phase monotonicity ·
truncation detection] PV -- hard_fail
3 attempts
+1.5× budget --> REGEN[Regenerate
with constraint message] REGEN --> LLM PV -- pass / best attempt --> OSF[Output scope filter
src/engagement/output_filter.py
strip recommendations on flagged unknown assets] OSF --> RW[Answer rewriter
src/agent/answer_rewriter.py
banner · sectioning · graceful fail-open] RW --> OUT[Operator-visible answer
+ AuditPanel JSON] style PROMPT fill:#e8f4f8,stroke:#1a73e8 style PV fill:#fce8e6,stroke:#d93025 style OSF fill:#fef7e0,stroke:#f9ab00 style OUT fill:#e6f4ea,stroke:#137333 Module → file table for engineers navigating the code: | Concern | File | Role | |---|---|---| | HTTP API | `src/serving/case_server.py` | FastAPI; `POST /api/cases/{id}/chat` is the entry point | | Skill dispatcher | `src/agent/skills.py` | `_skill_answer_freeform` orchestrates substrate fetch + LLM + verifier + regen | | Intent classification | `src/agent/intent.py` | Maps user message → skill | | Knowability routing | `src/agent/knowability_router.py` + `_llm_fallback.py` | 4-class classifier (regex first, LLM fallback) | | Technique catalog | `src/agent/technique_catalog.py` + `data/technique_catalog/*.yaml` | Records + render-to-prompt | | Playbooks | `src/agent/playbook_matcher.py` + `data/playbooks/*.yaml` | Multi-phase canonical sequences | | RAG | `src/agent/rag_loop.py` + `src/agent/rag_router.py` + `src/rag/*.py` | Per-pack retrieval, freshness ranking, tool-call integration | | Deterministic tools | `src/agent/deterministic_tools.py` + `src/rag/ingest_*.py` | CVE/MITRE/KEV lookups + ingest pipelines | | Provenance verifier | `src/agent/provenance_verifier.py` | Tag validity + regression patterns + phase monotonicity + truncation | | Prereq verifier | `src/agent/verifier.py` | Demote unsupported claims to "recon candidates" | | Scope filters | `src/engagement/scope.py`, `output_filter.py`, `input_scope_filter.py` | Asset-shape gate, in/out-of-scope policy | | Answer rewriter | `src/agent/answer_rewriter.py` | Section composition, banners, graceful fail-open | | Jailbreak detector | `src/agent/jailbreak_detector.py` | Reframe attempts that bypass prereq-checking | | Frontend | `frontend/src/` (React 18 + Vite + Tailwind + zustand + reactflow) | Chat UI + AuditPanel surfacing per-turn signals | ## Request lifecycle Sequence of what actually happens when you hit `POST /api/cases/{id}/chat`: sequenceDiagram autonumber participant Op as Operator participant API as case_server participant Sk as Skill: answer_freeform participant Sub as Substrate
(catalog · playbook · RAG · tools) participant LLM as vLLM (Qwen3-Next) participant Ver as Provenance Verifier participant Rw as Answer Rewriter Op->>API: POST /chat {message} API->>Sk: dispatch with engagement context Sk->>Sub: knowability route + match catalog + match playbook Sub-->>Sk: catalog hits + playbook + freshness-ranked RAG hits + tool defs Sk->>LLM: system_prompt + tools, max_tokens 1200 LLM-->>Sk: draft #1 loop up to 3 attempts Sk->>Ver: verify(draft, valid_hit_ids, tools_called) alt verifier passes Ver-->>Sk: clean else hard_failures Ver-->>Sk: findings (PwnKit-as-kernel? truncation? fabricated cite?) Sk->>LLM: regenerate with constraint message (+1.5× budget if truncation) LLM-->>Sk: draft #N+1 end end Sk->>Rw: model_draft + RewriteFindings Rw->>Rw: strip OOS-asset recommendations + add banners + section Rw-->>API: operator-visible answer API-->>Op: {assistant: {content, skill_calls: [...]}} Op->>Op: read content + inspect AuditPanel (catalog/playbook/
verifier findings/scope warnings) Per-turn latency is **3-30 s** on the v13.7 default config (median ~5 s warm). Long-tail latencies are typically a playbook firing + 3-attempt regen (the eval framework records this in `regen_attempts`). ## Knowability routing Not every question deserves the same answer shape. The knowability router classifies the prompt into one of four classes and the skill applies a class-specific policy: flowchart TD Q[User prompt] --> R1{Matches verifiable_fact
patterns?
'CVE-XXXX', 'what is the ATT&CK ID for X',
'is X in KEV', exact lookups} R1 -- yes --> VF[verifiable_fact
→ FORCE deterministic tool call
→ ship tool output as ground truth
→ short answer, citation required] R1 -- no --> R2{Matches time_decaying_documented?
'walk me through X', 'how does X work',
'best practice for Y', vendor behavior} R2 -- yes --> TD[time_decaying_documented
→ rag_required mode
→ catalog+playbook surface ALL specifics
→ freshness warning if cited chunk stale] R2 -- no --> R3{Matches pure_reasoning?
'rank these', 'which would you try first',
'compare X and Y', operator judgment} R3 -- yes --> PR[pure_reasoning
→ rag_optional
→ honest stop if recon required
→ may decline to rank without evidence] R3 -- no --> R4{Matches unverifiable_prediction?
'will Falcon detect this', 'is this stealthy',
'how likely is X'} R4 -- yes --> UP[unverifiable_prediction
→ MUST hedge with 'depends on…'
→ name what would let operator confirm
→ stance_violation hard-fail otherwise] R4 -- no --> FB[LLM fallback classifier
knowability_llm_fallback.py
→ Qwen call with class definitions
→ 80-token budget, low temperature] style VF fill:#e6f4ea,stroke:#137333 style TD fill:#e8f4f8,stroke:#1a73e8 style PR fill:#fef7e0,stroke:#f9ab00 style UP fill:#fce8e6,stroke:#d93025 The class is recorded in the audit (`knowability_routing.class_name`) so you can see, after the fact, which lens was applied to your question. ## Verifier + regenerate loop When the model produces a draft, the verifier runs **before** the rewriter. Hard failures trigger regeneration with the verifier's findings injected as a constraint message: flowchart TD D[Model draft] --> V[Provenance verifier] V --> C1{source/tool tag
validity} V --> C2{regression patterns
'indistinguishable from admin',
PwnKit-as-kernel, etc.} V --> C3{phase monotonicity
1→2→3→4→5→6→7} V --> C4{truncation detection
5 variants:
bullet stub, dangling colon,
no terminal punct, codey mid-token,
unbalanced brackets} V --> C5{stance for
knowability class} V --> C6{freshness warning
for stale cites} C1 & C2 & C3 & C4 & C5 & C6 --> AGG[Aggregate findings] AGG --> Q{hard_failures > 0
AND attempt < 3?} Q -- yes --> CON[Build constraint message
'fix the following before regenerating:
- PwnKit miscategorized — move to Phase 4
- truncation — answer ended mid-sentence'] CON --> BUMP{any truncation finding?
bump max_tokens 1.5×} BUMP --> LLM[Regenerate] LLM --> D Q -- no --> SHIP[Ship best attempt
'best' = fewest hard_failures] SHIP --> R[Answer rewriter:
add banners · strip OOS recommendations] style V fill:#fce8e6,stroke:#d93025 style SHIP fill:#e6f4ea,stroke:#137333 If all 3 attempts truncate, the answer ships **with a visible `⚠ Answer truncated` banner** instructing the operator to ask "continue from where the answer stopped" — never a silent cutoff, never a refusal. ## Installation ### Hardware | Component | Minimum | Recommended (v13.7 default) | |---|---|---| | GPU(s) | 1× 24 GB (single-GPU experimental) | **4× NVIDIA A10 (23 GB each, 92 GB total)** | | GPU compute capability | 7.5+ (RTX 20-series and newer) | 8.6 (A10) or 9.0 (H100) | | CPU | 8 cores | 16+ cores | | RAM | 32 GB | 64+ GB | | Disk | 200 GB (model + RAG packs) | 500 GB SSD | | Network | Outbound to HuggingFace for model + RAG ingest | Same | Tested production config: 4× NVIDIA A10 on a single host, 92 GB total VRAM, Ubuntu 22.04, CUDA 12.x. ### OS + CUDA # Ubuntu 22.04 reference; adjust for your distro sudo apt update && sudo apt install -y python3.10 python3.10-venv git build-essential # Verify CUDA + driver nvidia-smi # confirm driver >= 535 and CUDA >= 12.0 nvcc --version # confirm CUDA toolkit installed ### Path conventions used in this guide Pick two paths up front, set them as env vars, and use them everywhere. The launcher scripts read these env vars (or fall back to the dev defaults). **Adjust to your environment** — no path is hardcoded into the application logic. # Where your Python venv lives (any path you want) export VENV=$HOME/pentest-venv # Where you keep large model weights (must be on a disk with 100+ GB free) export MODELS_DIR=$HOME/models If you skip this step the launchers expect `/home/ubuntu/vllm-env` and `/home/ubuntu/models/` (the original developer's paths) — edit the `scripts/serve_*.sh` files to point at yours, or just export the env vars and the launchers honor them via the `MODEL=` / `PYTHON_BIN=` overrides shown below. ### Python environment + vLLM # Create the venv at $VENV (or wherever you prefer) python3.10 -m venv $VENV source $VENV/bin/activate # Core LLM serving pip install --upgrade pip pip install vllm # tested with vLLM 0.20.1 pip install transformers torch accelerate # Project deps (small — the heavy stuff comes from the vllm install above) pip install -r requirements.txt # FastAPI + supporting pip install fastapi 'uvicorn[standard]' httpx pyyaml pydantic huggingface_hub To run the launcher scripts under your venv, either activate it before launching (`source $VENV/bin/activate && ./scripts/...`) or pass the python explicitly. Two of the launchers reference `/home/ubuntu/vllm-env/bin/...` directly — replace those with `$VENV/bin/...` once, or use `sed`: # One-shot: rewrite all launcher scripts to use your venv find scripts -name 'serve_*.sh' -exec \ sed -i "s|/home/ubuntu/vllm-env|$VENV|g" {} + ### Model weights Default model: **Qwen3-Next-80B-A3B-Instruct-FP8** (~77 GB on disk). mkdir -p $MODELS_DIR && cd $MODELS_DIR hf download Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \ --local-dir $MODELS_DIR/Qwen3-Next-80B-A3B-Instruct-FP8 \ --max-workers 8 # Alternatives (see "Supported models" section for trade-offs): # hf download QuantTrio/Qwen3.6-35B-A3B-AWQ --local-dir $MODELS_DIR/Qwen3.6-35B-A3B-AWQ # hf download Qwen/Qwen2-72B-Instruct-AWQ --local-dir $MODELS_DIR/Qwen2-72B-Instruct-AWQ Download takes 10-30 minutes depending on bandwidth (~77 GB for the default model). If you put weights at a non-default path, pass `MODEL=` when launching: MODEL=$MODELS_DIR/Qwen3-Next-80B-A3B-Instruct-FP8 \ ./scripts/serve_qwen3_next_80b_fp8.sh Or one-shot rewrite the launchers as you did for the venv: find scripts -name 'serve_*.sh' -exec \ sed -i "s|/home/ubuntu/models|$MODELS_DIR|g" {} + ### RAG packs The RAG packs (`data/rag/packs/*/`) are not in git (too large). Re-ingest from source: PYTHONPATH=. python -m src.rag.ingest_cisa_kev # CISA KEV (~5 MB, daily refresh) PYTHONPATH=. python -m src.rag.ingest_mitre_attack # MITRE ATT&CK STIX (~10 MB) PYTHONPATH=. python -m src.rag.ingest_nvd_cve # NVD CVE feed (~hours, ~500 MB) PYTHONPATH=. python -m src.rag.ingest_github_releases # Tool release notes PYTHONPATH=. python -m src.rag.ingest_research_blogs # SpecterOps + others RSS # operator_hacktricks pack: pull HackTricks book from upstream, run src/rag/ingest_hacktricks.py Each ingest writes `data/rag/packs//chunks.jsonl` and `manifest.json` (freshness metadata, trust_tier). ### Frontend (optional — API works headless) cd frontend npm install npm run dev # http://localhost:5173, proxies API to localhost:8000 # Production build: npm run build → frontend/dist/ ### Sanity check # 1. Bring up vLLM ./scripts/serve_qwen3_next_80b_fp8.sh # ~3-5 min to load # Watch the log until "Application startup complete": tail -F /tmp/vllm_qwen3next.log # 2. Bring up case_server (separate terminal) ./scripts/serve_case_workspace_qwen3_next.sh # 3. Test the API curl http://127.0.0.1:9000/v1/models # confirms vLLM up curl http://127.0.0.1:8000/health # confirms case_server up ## Running it ### Daily startup (after install) # Terminal 1: vLLM backend (v13.7 default: 8K context) MAX_LEN=8192 GPU_MEM=0.95 TOOL_CALL_PARSER=hermes \ ./scripts/serve_qwen3_next_80b_fp8.sh # Terminal 2: case workspace + scope policy + verifier pipeline ./scripts/serve_case_workspace_qwen3_next.sh # Terminal 3 (optional): frontend cd frontend && npm run dev The case_server defaults to `0.0.0.0:8000` — bind it to `127.0.0.1` if you don't want LAN access (`PORT=8000 HOST=127.0.0.1` override). ### Direct API usage # Create a case CASE=$(curl -s -X POST http://127.0.0.1:8000/api/cases \ -H "Content-Type: application/json" \ -d '{"title":"Engagement Acme — internal pentest"}' | jq -r .case_id) # Ask a question curl -s -X POST http://127.0.0.1:8000/api/cases/$CASE/chat \ -H "Content-Type: application/json" \ -d '{"message":"I have a normal domain user. What is the cheapest path to DA?"}' \ | jq '{ content: .assistant.content, playbook: .assistant.skill_calls[0].playbook.playbook_id, catalog: [.assistant.skill_calls[0].technique_catalog.matches[].technique_id], hard_failures: .assistant.skill_calls[0].provenance_verifier.n_hard_failures, scope: .assistant.skill_calls[0].scope_filter.status }' ### Attaching an engagement scope (optional but recommended) Drop a JSON file at `data/engagements/.json`: { "name": "Acme internal Q2-2026", "mode": "ad_v1", "assets": [ { "id": "a1", "value": "dc01.corp.acme.local", "type": "hostname", "scope_status": "known_in_scope", "role": "domain controller" }, { "id": "a2", "value": "10.10.20.0/24", "type": "cidr", "scope_status": "known_in_scope", "role": "internal subnet" }, { "id": "a3", "value": "billing-prod.acme.com", "type": "hostname", "scope_status": "known_out_of_scope", "role": "explicitly excluded" } ], "roe": { "allowed_actions": ["passive recon", "authenticated AD enumeration"], "forbidden_actions": ["destructive testing", "social engineering"], "approval_required": ["active scanning", "credential cracking"] }, "engagement_id": "case-" } The output scope filter then: - Strips recommendations referencing `billing-prod.acme.com` (known out of scope) - Queues unknown assets for operator review (`scope_status: "warn"`) - Allows in-scope mentions through silently ## Your first session — three example questions Three example operator turns showing what comes back, what the audit signals mean, and how to read the AuditPanel. ### Q1 (verifiable_fact path) **What happens internally**: 1. Knowability router classifies as `verifiable_fact` (regex match on `CVE-\d{4}-\d+`) 2. Skill forces `check_kev` tool call (tool_choice=required) 3. Tool returns: `{vendor: Palo Alto, product: PAN-OS GlobalProtect, date_added: 2024-04-12, due_date: 2024-04-19, known_ransomware_use: true}` 4. Model renders the tool output with `[tool: check_kev]` tags **You get**: **AuditPanel shows**: - `knowability_routing.class_name: verifiable_fact` - `tools_called: [check_kev]` - `hard_failures: 0` - `fake_citations: 0` ### Q2 (time_decaying_documented + playbook fires) **What happens internally**: 1. Knowability router classifies as `time_decaying_documented` (matches "want to get root") 2. Playbook matcher hits `linux_privesc_ranking` (8 phases: Context check → Enum → Cred hunt → Sudo → SUID → Capabilities → Cron → Kernel) 3. Catalog matcher hits `linux_sudo_misconfig`, `linux_capability_abuse`, `linux_kernel_privesc` 4. RAG retrieves 2 hits from `operator_hacktricks` 5. Skill composes prompt with playbook + catalog + RAG; model renders all 8 phases verbatim from the playbook **You get**: Full 8-phase walkthrough with exact commands (`wget linpeas`, `getcap -r /`, `sudo find /etc -exec /bin/sh \; -quit`), modern cred-hunt paths (`.env`, `.aws/credentials`, `.kube/config`), honest sudo OPSEC ("logged but often unmonitored", not "indistinguishable"), `-p` flag in cron exploit, `/dev/shm` fallback for `/tmp noexec`. **AuditPanel shows**: - `knowability_routing.class_name: time_decaying_documented` - `playbook.playbook_id: linux_privesc_ranking` - `technique_catalog.matches: [linux_sudo_misconfig, linux_capability_abuse, linux_kernel_privesc]` - `rag.hit_ids: [operator_hacktricks:..., operator_hacktricks:...]` - `regenerate_loop.total_attempts: 1` (clean on first attempt) ### Q3 (pure_reasoning + honest stopping) **What happens internally**: 1. Knowability router classifies as `pure_reasoning` (matches "rank top 3") 2. Prereq verifier flags: "Potato family requires SeImpersonate; not held by interactive low-priv by default" 3. Skill produces a draft, verifier finds the model ranked confidently without enum 4. Stance check fails for `pure_reasoning` class; regen with constraint "no confident rank without observed `whoami /priv` output" **You get**: A refusal-to-rank with the honest senior framing — "I can't give a confident ranking without `whoami /priv` and `whoami /groups` from the target. Interactive low-priv users on Win11 typically do NOT hold SeImpersonate, so the Potato family is OFF the table by default. Run first: `whoami /priv`, `whoami /groups`, … Paste back." **AuditPanel shows**: - `knowability_routing.class_name: pure_reasoning` - `prereq_verifier.downgrades: [{technique: 'JuicyPotatoNG', why: 'requires SeImpersonate'}]` - `provenance_verifier.findings: []` after regen (started with stance_violation) - `regenerate_loop.total_attempts: 2` This is the "senior pentester wouldn't rank without recon" pattern, structurally enforced. ## Adding content The substrate is the primary quality lever. Two add patterns: ### Adding a catalog entry A catalog entry is a hand-curated technique record. Example walkthrough — adding a hypothetical "AWS IMDS credential exfil" entry: # 1. Create the YAML cat > data/technique_catalog/aws_imds_exfil.yaml <<'YAML' technique_id: aws_imds_exfil canonical_name: "AWS IMDSv1 credential exfiltration" attack_ids: - T1552.005 # Unsecured Credentials: Cloud Instance Metadata API aliases: - "IMDS" - "IMDSv1 exfil" - "AWS instance metadata creds" last_reviewed: 2026-05-27 reviewed_by: "operator" summary: > EC2 instances using IMDSv1 (not IMDSv2) expose temporary IAM credentials at http://169.254.169.254/latest/meta-data/iam/security- credentials/. Any process with network access to the IMDS link-local address can read these credentials without authentication. Common SSRF + IMDS chains turn a webapp SSRF into AWS account takeover. defining_conditions: - id: imds_v1_enabled label: "Instance metadata service set to allow IMDSv1 (HttpTokens=optional)" verification: "curl -s http://169.254.169.254/latest/meta-data/ # returns content WITHOUT a token header" - id: instance_has_role label: "EC2 instance has an attached IAM role (instance profile)" verification: "curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/" prerequisites: - "Network reachability to 169.254.169.254 (on-instance, container with host network, or via SSRF chain)" canonical_tooling: preferred: - "curl + jq (minimal footprint)" acceptable: - "AWS SDK (boto3) directly to STS once creds extracted" exploitation_steps: - "Enumerate the role name: `curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/`" - "Fetch credentials: `curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/`" - "Configure AWS CLI / boto3 with returned AccessKeyId / SecretAccessKey / Token" - "Verify: `aws sts get-caller-identity` — returns the assumed role ARN" - "Enumerate permissions: `aws iam simulate-principal-policy` or test specific service APIs" detection_signals: - "CloudTrail event: STS sts:AssumeRole or first-use of the role from unusual source IP" - "GuardDuty: Recon:IAMUser/MaliciousIPCaller or UnauthorizedAccess:IAMUser/InstanceCredentialExfiltrationOutsideAWS" - "VPC Flow Logs: outbound traffic to 169.254.169.254 from unexpected processes" hardened_mitigations: - "Enforce IMDSv2 (HttpTokens=required) at the AWS Organization level via SCP" - "Set hop limit to 1 (HttpPutResponseHopLimit=1) so containers can't reach IMDS" - "Restrict IAM role permissions to least-privilege; never attach AdministratorAccess to runtime roles" references: - source: "AWS docs — Configure the instance metadata service" url: "https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html" - source: "MITRE ATT&CK T1552.005" url: "https://attack.mitre.org/techniques/T1552/005/" YAML # 2. Register the keyword mapping in retrieval_coverage so the substrate # coverage check knows what concept keywords to look for in retrieved # chunks for this technique. # Edit src/agent/retrieval_coverage.py, add an entry to # CONCEPT_KEYWORDS_BY_CONDITION_ID: # "imds_v1_enabled": ["HttpTokens", "IMDSv1", "instance metadata", "169.254.169.254"], # "instance_has_role": ["instance profile", "iam role", "security-credentials"], # 3. Restart case_server to pick up the new catalog entry pkill -f 'uvicorn.*case_server' ./scripts/serve_case_workspace_qwen3_next.sh Now an operator asking "I got IMDS access on an EC2 instance, what next?" will get the entry rendered into the prompt. ### Adding a playbook A playbook is a multi-phase sequence triggered by intent + context patterns. Same idea, different YAML schema: playbook_id: aws_post_imds_exfil canonical_name: "Post-IMDS exfil: AWS lateral movement and persistence" last_reviewed: 2026-05-27 applies_when: intent_patterns: - '\b(?:got|obtained|extracted)\s+(?:aws|ec2)\s+(?:creds|credentials|keys)\b' - '\baws\s+(?:lateral|pivot|escalate)\b' context_patterns: - '\bAWS\b' - '\bsts\s+get-caller-identity\b' canonical_sequence: - phase: "Phase 1 — Identify scope of stolen credentials" commands: - "aws sts get-caller-identity" - "aws iam list-attached-role-policies --role-name " - "aws iam list-role-policies --role-name " triage: "..." - phase: "Phase 2 — Map permissions to attack surface" commands: ["..."] # ... etc See `data/playbooks/linux_privesc_ranking.yaml` for a full multi-phase reference. ### Adding a RAG pack For a new corpus (e.g. "AWS security blog feeds"): 1. Write `src/rag/ingest_aws_blogs.py` following the pattern in `src/rag/ingest_research_blogs.py` 2. Add the pack name to `src/rag/packs.py` registry 3. Add a routing keyword in `src/agent/rag_router.py` so AWS-related questions retrieve from this pack 4. Run the ingester: `python -m src.rag.ingest_aws_blogs` 5. Verify: `ls data/rag/packs/aws_blogs/` shows `chunks.jsonl` + `manifest.json` ## Supported models + GPU requirements The system is model-agnostic — swap the `serve_*.sh` launcher and the matching `serve_case_workspace_*.sh` to switch models. Tested models: | Model | Quant | Weights | TP | VRAM total | Context | Tool parser | Notes | |---|---|---|---|---|---|---|---| | **Qwen3-Next-80B-A3B-Instruct-FP8** | FP8 (native) | 77 GB | 4 | 92 GB | 8 K | `hermes` | **v13.7 default.** Best substrate grounding | | Qwen3.6-35B-A3B-AWQ (`QuantTrio/`) | AWQ 4-bit | 24 GB | 4 | 32 GB+ | 32 K | `qwen3_coder` | Thinking-mode; v13.6 experiment | | Qwen3-32B | BF16 | 64 GB | 4 | 92 GB | 16 K | `hermes` | Dense; weaker substrate use | | Qwen2-72B-Instruct-AWQ | AWQ 4-bit | 38 GB | 4 | 60 GB+ | 16 K | `hermes` | Dense; v13.4 experiment, regressed | | Llama-3.3-70B-Instruct-AWQ | AWQ 4-bit | 40 GB | 4 | 60 GB+ | 32 K | `llama3_json` | Untested in pipeline | | DeepSeek-R1-Distill-Llama-70B-AWQ | AWQ 4-bit | 40 GB | 4 | 60 GB+ | 8 K | `deepseek_v3` | Reasoning-heavy outputs | GPU compatibility: | GPU | VRAM | Per-GPU | Can run default model? | Can run 35B AWQ? | |---|---|---|---|---| | NVIDIA H100 80GB | 80 GB | 80 GB | yes (1×) | yes (1×) | | NVIDIA A100 80GB | 80 GB | 80 GB | yes (1×) | yes (1×) | | NVIDIA A100 40GB | 40 GB | 40 GB | yes (2×) | yes (1×) | | **NVIDIA A10 23GB (default)** | 23 GB | 23 GB | yes (4×) | yes (2×) | | NVIDIA A6000 48GB | 48 GB | 48 GB | yes (2×) | yes (1×) | | NVIDIA RTX 4090 24GB | 24 GB | 24 GB | yes (4×) | yes (2×) | | NVIDIA RTX 3090 24GB | 24 GB | 24 GB | yes (4×, ≥8.6 compute cap) | yes (2×) | Single-GPU experimentation is possible with the smaller AWQ variants (Qwen3.6-35B-A3B-AWQ on a 24 GB card with `TP=1 MAX_LEN=8192 GPU_MEM=0.9`). ## Configuration reference ### Environment variables | Var | Default | Effect | |---|---|---| | `MODEL` | `$MODELS_DIR/Qwen3-Next-80B-A3B-Instruct-FP8` (dev default: `/home/ubuntu/models/...`) | Model weights path — set when running, or edit the launcher | | `PYTHON_BIN` | `$VENV/bin/python` (dev default: `/home/ubuntu/vllm-env/bin/python`) | Python interpreter the launcher uses; relevant for case_server scripts | | `PORT` | `9000` (vLLM), `8000` (case_server) | HTTP bind port | | `TP` | `4` | vLLM tensor-parallel size | | `GPU_MEM` | `0.95` | Fraction of VRAM vLLM can use | | `MAX_LEN` | `4096` (script default), `8192` (v13.7 recommended) | Per-sequence context tokens | | `MAX_NUM_SEQS` | `1` | Concurrent in-flight requests | | `MAX_NUM_BATCHED_TOKENS` | `4096` | vLLM batch budget | | `TOOL_CALL_PARSER` | `hermes` (v13.7) | vLLM tool-call extractor; see launcher comments | | `PTC_MODEL_PROFILE` | profile JSON path | Which `config/model_profiles/*.json` to use | | `PTC_ANSWER_MAX_TOKENS` | `1200` | Output budget for substrate-grounded answers (600 fallback) | | `PTC_FREEFORM_SYSTEM_PROMPT` | embedded in launcher | Operator-framing system prompt | | `PTC_ANSWER_REWRITER` | `1` | Enable rewriter (`0` to disable for diagnostic) | ### Model profiles `config/model_profiles/*.json` maps role names to the served-model name. Example: { "name": "qwen3-next-vllm", "endpoint": "http://127.0.0.1:9000/v1", "models": { "answer_freeform": "qwen3-next-80b", "intent_classifier": "qwen3-next-80b", "knowability_fallback": "qwen3-next-80b" } } Each `serve_case_workspace_*.sh` exports `PTC_MODEL_PROFILE` pointing at the right profile so the right model name flows through. ## Substrate layout 1. Technique catalog (data/technique_catalog/*.yaml) — 38 entries ├── technique_id, attack_ids, aliases ├── defining_conditions (the misconfigs that ENABLE the technique) ├── prerequisites (what the OPERATOR must have) ├── canonical_tooling (preferred / acceptable / legacy) ├── exploitation_steps (verbatim commands) ├── detection_signals (event IDs, log sources) ├── hardened_mitigations (what breaks this technique) └── references (SpecterOps / MITRE / vendor docs) 2. Playbooks (data/playbooks/*.yaml) — 16 entries ├── intent_patterns + context_patterns (matchers) ├── canonical_sequence: [phase 0..N] │ ├── rationale │ ├── commands │ ├── triage │ └── opsec (honest — "logged, often unmonitored" ≠ "indistinguishable from admin") ├── expected_outputs ├── do_not_skip └── opsec_notes 3. RAG packs (data/rag/packs//) — 6 corpora ├── operator_hacktricks (HackTricks book chunks) ├── cve_kev (CISA Known Exploited Vulns) ├── attack_mitre (MITRE ATT&CK STIX) ├── nvd_cve (CVE descriptions) ├── github_releases (tool release notes) └── research_blogs (SpecterOps + others, RSS-ingested) Each pack: chunks.jsonl + manifest.json (freshness, trust_tier, source) ## Structural verifier rules Every answer runs through these before shipping. Each is a hard-fail that triggers regeneration with a constraint message. | Rule ID | Catches | |---|---| | `hallucinated_source` | `[source: pack:chunk]` cites a hit_id not in this turn's retrieval | | `hallucinated_tool` | `[tool: name]` cites a tool that wasn't called this turn | | `invalid_cve` / `invalid_attack_id` | CVE-YYYY-NNNN or T-XXXX in the body that doesn't resolve in substrate | | `stance_violation` | Forbidden phrase for the knowability class (e.g. confident-rank language on a recon-required prompt) | | `stale_uncited` | Cited chunk older than the class's freshness horizon without an inline `(N days ago)` warning | | `regression:opsec_indistinguishable_from_admin` | False OPSEC claim about sudo/SUID being invisible to SOC | | `regression:pwnkit_as_kernel_exploit` (negation-aware) | PwnKit categorized as kernel exploit (it's userspace pkexec) | | `regression:kernel_phase_lists_pwnkit` | PwnKit listed in a kernel-exploit phase | | `regression:suid_bash_drop_without_p` | `chmod u+s ...sh` without the mandatory `-p` invocation | | `regression:uncited_high_percentage_claim` | "95% of vectors" / "Falcon catches 80%" without a citation or `[model-default]` tag | | `regression:phase_numbering_gap` | "Phase 1 → 2 → 3 → 4 → 6" structural skip | | `regression:phase_numbering_start` | Sequence doesn't start at Phase 1 | | `regression:truncation_bullet_stub` | Answer ends with empty bullet markers (`*`, `-`, `1.`) | | `regression:truncation_dangling_colon` | "Phase 7 includes:" with no items below | | `regression:truncation_no_terminal_punct` | Last prose line ends mid-sentence (no `.!?:`) | | `regression:truncation_codey_mid_token` | Code block ends mid-token (`uid=0(root` with no closing paren) | | `regression:truncation_unbalanced_brackets` | Last line has open brackets/backticks without matching close | **Asset-shape scope filter** (`src/engagement/scope.py`) applies the same idea to scope warnings: ~30 tool-repo hosts (github.com, gtfobins.github.io, learn.microsoft.com, …), Python/PowerShell/DB-system namespace prefixes (`os.*`, `mysql.*`, `information_schema.*`), config-file basenames (`.env`, `.aws/credentials`, `.kube/config`, `.my.cnf`), and cloud-metadata IPs (169.254.169.254 + GCP/Azure equivalents) are filtered at extraction time. ## Quality eval framework eval/quality/ ├── cases/ # P01-P12 (Win/AD/CVE) + L01 (Linux) │ └── PXX.json # {prompt_id, user_prompt, target_context, │ # reference_answer (hand-written), reviewer_notes} ├── rubric.md # 6-axis 0/1/2 rubric (12 max) │ ├── (a) technical correctness # fabricated names / wrong categorization │ ├── (b) confident-wrong absence # HARD-FAIL — fabricated facts │ ├── (c) prerequisite honesty # named prereqs per technique │ ├── (d) recon-first # didn't rank without enum │ ├── (e) operational realism # senior's notes vs textbook │ └── (f) honest stopping # said "I need X" when X is missing ├── judge.py # LLM-as-judge for axes a/b/e (oracle = reference_answer) ├── judge_baseline.py # End-to-end: case_server → judge → markdown report ├── judge_calibration.py # 4-test suite verifying judge isn't biased ├── detectors.py + scorer.py # Deterministic checks for axes c/d/f └── runner.py + aggregate.py + report.py Run the quality baseline: PYTHONPATH=. python -m eval.quality.judge_baseline --run-id my_baseline # → docs/quality_my_baseline.md (per-case scores + judge rationales) Run the judge calibration suite (verify the judge isn't biased before trusting scores): PYTHONPATH=. python -m eval.quality.judge_calibration # → docs/quality_judge_calibration.md (target: ≥90% pass) Run the 151-prompt × 30-topic coverage corpus: PYTHONPATH=. python -m eval.coverage.runner --run-id my_coverage PYTHONPATH=. python -m eval.coverage.aggregate --run-id my_coverage \ --out docs/my_coverage_matrix.md Current v13.7 numbers: | Metric | Value | |---|---| | Judge calibration | 97% (34/35 tests) | | 5-case oracle baseline | 27/30 = 8.7/10 on judged axes | | Coverage substrate-hit % | 92.7 | | Coverage structured-hit % | 45.0 | | Hard failures / 151 prompts | 1 | | Fake citations / 151 prompts | 0 | ## Repository layout src/ # Python sources ├── agent/ # skills, knowability, catalog, playbook, verifiers, rewriter ├── engagement/ # scope filters, engagement YAML loader ├── rag/ # pack format, ingestion, retrieval ├── serving/ # FastAPI case_server + case UI static └── evals/ # historical eval scripts (predecessor to eval/) eval/ # current eval frameworks ├── coverage/ # 151-prompt × 30-topic corpus runner + aggregator └── quality/ # LLM-judge baseline (rubric + reference answers) data/ # substrate + fixtures (large dirs gitignored) ├── technique_catalog/ # 38 catalog YAMLs (TRACKED) ├── playbooks/ # 16 playbook YAMLs (TRACKED) ├── engagements/ # 117 eval scope fixtures (TRACKED, all client.local fake hostnames) ├── rag/packs/ # ingested RAG pack chunks (GITIGNORED — regenerate with src/rag/ingest_*.py) ├── raw/ # source HTML / STIX / JSON for RAG ingest (GITIGNORED, 2.8 GB) └── cases/ # per-session operator chat history (GITIGNORED) scripts/ # vLLM + case_server launchers per model ├── serve_qwen3_next_80b_fp8.sh # ← v13.7 default ├── serve_case_workspace_qwen3_next.sh # ← v13.7 default ├── serve_qwen36_35b_a3b_awq.sh # v13.6 experiment ├── serve_qwen2_72b_awq.sh # v13.4 experiment (regressed) └── ... config/model_profiles/ # JSON: which served-model name maps to which role per profile tests/ # 41 pytest files — run with `pytest tests/` docs/ # architecture docs, eval reports, diffs frontend/ # React 18 + Vite + Tailwind + zustand + reactflow └── src/components/chat/AuditPanel.tsx # surfaces per-turn substrate/verifier signals ## Troubleshooting ### vLLM OOM at startup vLLM sees less free VRAM than reported. Fix: GPU_MEM=0.92 ./scripts/serve_qwen3_next_80b_fp8.sh # was 0.95 Or, if still OOM, reduce context: MAX_LEN=4096 MAX_NUM_BATCHED_TOKENS=4096 GPU_MEM=0.92 ./scripts/serve_qwen3_next_80b_fp8.sh ### vLLM startup fine, but `/chat` returns HTTP 400 Token budget bust: input (system prompt + catalog + playbook + RAG hits) + output budget exceeds `MAX_LEN`. Either: - Bump `MAX_LEN=8192` (needs ≥1 GB extra VRAM headroom per GPU) - Or reduce `PTC_ANSWER_MAX_TOKENS=800` (was 1200) ### Answer is truncated The truncation detector should catch this and ship with a `⚠ Answer truncated` banner. If you see truncation without a banner, the regen attempts exhausted but the banner logic in `answer_rewriter.py` may not have triggered — check `audit.skill_calls[0].provenance_verifier.findings` for `regression:truncation_*`. Bump `PTC_ANSWER_MAX_TOKENS=2000` and retry. ### Scope filter flags `mysql.func` / `github.com` / `os.system` as candidate assets This is the asset-shape gate misclassifying — should be filtered at extraction (`src/engagement/scope.py`). If you see new tokens of this class being flagged, add them to `_TOOL_REPO_HOSTS`, `_CODE_NAMESPACE_PREFIXES`, `_CONFIG_FILENAMES`, or `_METADATA_LINKLOCAL_IPS` and restart case_server. ### Verifier rejects every answer Check `audit.skill_calls[0].provenance_verifier.findings`. The most common new-content failure is `regression:opsec_indistinguishable_from_admin` if you've imported a playbook with the old OPSEC wording. Fix the source YAML and restart. ### Model fabricates "this command pretends to have executed" outputs System prompt instruction is missing or got overridden. Check `scripts/serve_case_workspace_qwen3_next.sh` for the `NO FAKE EXECUTION CONTEXT` block in `PTC_FREEFORM_PROMPT`. Restart case_server after editing. ### Frontend `npm install` fails Use Node 18+ (the project uses ESM). `node --version`. Then `cd frontend && rm -rf node_modules && npm install`. ## Honest limitations What this is not, and where it currently misses: - **Not an autonomous agent.** It answers questions and renders playbook steps; the operator runs the commands. No tool-execution sandbox. - **Not a substitute for a senior pentester.** Verified quality is ~8.7/10 on a hand-graded 5-case set — useful as a junior teammate, not as a final reviewer. Edits expected on roughly one in four answers. - **Substrate coverage is uneven.** 38 catalog entries and 16 playbooks cover Linux/Windows/AD privesc, ADCS, Kerberos, web injection, external recon well. Cloud (AWS/Azure/GCP), container escapes (Docker socket / k8s), modern CI/CD attack surface are thinner. - **Judge is the same model family as the system being judged.** Calibrated to 97% but has a known false-positive on the PKINIT EKU OID. Planned: cross-judge with a different family. - **No real-time pack updates.** RAG packs are re-ingested manually (`src/rag/ingest_*.py`). Freshness metadata is tracked per chunk; stale citations get an inline warning. KEV/CVE data via deterministic tools is fresher than RAG. - **AuditPanel surfaces substrate signals** but doesn't yet show *what the verifier rejected and why* — the data exists in the response audit, just not rendered. ## Authorized use Designed for authorized engagements only — lab work, CTFs, owned systems, sanctioned pentests. Input/output scope filters enforce engagement boundaries when a scope file is provided at `data/engagements/.json`. The system does not refuse questions on offensive technique substance; the jailbreak detector reframes prompts that ask it to ignore prerequisite-checking or claim a technique "won't be detected". This is a copilot for the operator, not a wrapper that decides whether to act. **Authorization is your responsibility.** ## Repo state - Active branch: `orchestration-v1` - Latest substantive commit: `v13.7: structural verifiers, quality eval framework, model swap experiments` - Release notes per version: `RELEASE_NOTES_v12.md`, `RELEASE_NOTES_v13.md`, `RELEASE_NOTES_v13_1.md` - Runbook: `RUNBOOK.md` (operational procedures) - Architecture deep-dive: `docs/v13_ARCHITECTURE.md`, `docs/v13_DIAGRAMS.md`, `docs/v13_CODE_MAP.md` - Developer guide: `docs/v13_DEVELOPER_GUIDE.md`