# Agentic Pentesting Framework (APF)
A multi-agent LLM framework for autonomous penetration testing. APF drives a 7-phase pipeline — strategy, reconnaissance, scanning, branch splitting, exploitation, validation, reporting — through a model-agnostic Python orchestrator that talks to any LiteLLM-compatible LLM and any MCP-exposed security toolchain.
*Bachelor's thesis (ZHAW, 2026). Research question: does structured, phase-driven workflow guidance improve an LLM's end-to-end penetration-testing effectiveness and token efficiency over unguided single-agent execution, and does this benefit hold across models of substantially different capability — from weaker open-weight models (Gemma 4 26B/31B, Llama 3.3) to a frontier proprietary model (Claude Opus 4.7)?*
orchestrator.py
│
├── Phase 1 Strategy → storage/
/session_state.json
├── Phase 2 Reconnaissance → storage//ptt.json
├── Phase 3 Scanning → storage//ptt_extended.json
├── Phase 4 PTT Splitting → storage//ptt_branches.json
├── Phase 5 Exploitation → storage//branch_N/execution.json (parallel)
├── Phase 6 Validation → storage//findings.json (parallel)
└── Phase 7 Reporting → storage//report.json
Every run writes to a fresh timestamped folder under `storage/` and a deterministic markdown summary at `results//report.md`. The full pipeline, the agent format, and the JSON contracts are documented in [docs/architecture.md](docs/architecture.md).
## Contents
- [Quick start](#quick-start)
- [Reproduce the thesis ablations](#reproduce-the-thesis-ablations)
- [Where to find what](#where-to-find-what)
- [Tech stack](#tech-stack)
## Quick start
**1 — Clone with submodules** (HexStrike + validation-benchmarks)
git clone --recursive https://github.com/Manuelvillarvieites/Agentic-Pentesting-Framework.git
cd Agentic-Pentesting-Framework
# If you already cloned without --recursive: git submodule update --init --recursive
**2 — Patch upstream XBEN compose-file bugs** (one-time, idempotent)
python3 scripts/patch_benchmarks.py
**3 — Python dependencies**
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install fastmcp requests # host-side MCP adapter deps
**4 — API keys** (only the providers you use)
cp docker/.env.example docker/.env
# then edit docker/.env
**5 — Start the Docker stack** (LiteLLM + HexStrike-on-Kali)
docker compose -f docker/docker-compose.yml up -d
**6 — Verify**
python3 scripts/orchestrator.py 127.0.0.1 --dry-run
curl http://localhost:8889/health # → {"status": "ok"}
**7 — Start an XBEN benchmark target** (Easy10 LFI example, ~5 min on a laptop)
cd external/validation-benchmarks/benchmarks/XBEN-019-24
make build && make run # build (injects the FLAG) + start the container
# `docker compose ps` shows one row — the web app publishes port 80 on a random host port:
# xben-019-24-lfi_static_resource-1 ... 0.0.0.0:49280->80/tcp, [::]:49280->80/tcp
# ↑ this number
# Auto-extract that host port into XBEN_PORT (copy as-is):
export XBEN_PORT=$(docker compose ps | grep -oE '0\.0\.0\.0:[0-9]+->80/tcp' | head -1 | cut -d: -f2 | cut -d- -f1)
echo "Benchmark is on http://localhost:${XBEN_PORT}"
cd -
**8 — Run a pentest** against the benchmark
python3 scripts/orchestrator.py host.docker.internal:${XBEN_PORT} \
--allocation weighted --model claude-sonnet-4-6
A full setup walk-through lives in [docs/setup.md](docs/setup.md). Every command APF supports, grouped by use case, is in [docs/commands.md](docs/commands.md).
## Reproduce the thesis ablations
**Set the Harder20 variable** (paste once per shell)
export HARDER20="XBEN-002-24,XBEN-006-24,XBEN-009-24,XBEN-029-24,XBEN-030-24,XBEN-034-24,XBEN-035-24,XBEN-038-24,XBEN-039-24,XBEN-040-24,XBEN-054-24,XBEN-056-24,XBEN-057-24,XBEN-060-24,XBEN-069-24,XBEN-077-24,XBEN-078-24,XBEN-080-24,XBEN-084-24,XBEN-097-24"
**1 — Claude Opus 4.7 / Easy10**
python3 scripts/benchmark_sweep.py -e apf-vs-baseline --model claude-opus-4-7
**2 — Claude Opus 4.7 / Harder20**
python3 scripts/benchmark_sweep.py -e apf-vs-baseline --model claude-opus-4-7 --benchmarks $HARDER20
**3 — Llama 3.3 70B / Easy10** (ZHAW Ollama, VPN required)
python3 scripts/benchmark_sweep.py -e apf-vs-baseline --model llama3.3
**4 — Llama 3.3 70B / Harder20**
python3 scripts/benchmark_sweep.py -e apf-vs-baseline --model llama3.3 --benchmarks $HARDER20
**5 — Gemma 4 26B (local Ollama) / Easy10**
python3 scripts/benchmark_sweep.py -e apf-vs-baseline --model ollama/gemma4:26b
**6 — Gemma 4 26B (local Ollama) / Harder20**
python3 scripts/benchmark_sweep.py -e apf-vs-baseline --model ollama/gemma4:26b --benchmarks $HARDER20
## Where to find what
| Document | Purpose |
|----------|---------|
| [docs/setup.md](docs/setup.md) | Installation: Python, Docker, MCP, API keys |
| [docs/commands.md](docs/commands.md) | Every command grouped by use case (verify, pentest, sweeps, inspect, stop) |
| [docs/reproducibility.md](docs/reproducibility.md) | Thesis ablation recipes — copy-paste sweep commands |
| [docs/architecture.md](docs/architecture.md) | 7-phase pipeline, agent format, skill registry, allocation strategies |
| [docs/llm-config.md](docs/llm-config.md) | Switching LLMs, LiteLLM, local Ollama with `gemma4:26b`, Llama 3.3 notes |
| [docs/benchmarks.md](docs/benchmarks.md) | XBOW benchmarks — Easy10 / Harder20 definitions, per-target setup |
| [docs/development.md](docs/development.md) | Module-by-module source reference, custom-loop rationale, dev TODO |
| [docs/diagrams/](docs/diagrams/) | Editable `.drawio` sources and `.png` exports of the thesis figures (architecture overview, runtime sequence) — the same PNGs are embedded in the report |
| [docs/research/](docs/research/) | Thesis artefacts — ablation reports, evaluation metrics, earlier figure drafts |
| [docs/meeting-notes/](docs/meeting-notes/) | Weekly supervisor meeting minutes (Feb–May 2026), referenced from the report appendix |
| [docs/BA_Timeplan.pdf](docs/BA_Timeplan.pdf) | Project Gantt time plan (also embedded in the report appendix) |
| [AGENTS.md](AGENTS.md) | Agent reference: `task_response` schema, attack quick-reference, FLAG pattern |
| [CLAUDE.md](CLAUDE.md) | Project conventions used by Claude Code when editing this repo |
## Tech stack
| Component | Choice |
|-----------|--------|
| Orchestrator | Python 3.11, `scripts/orchestrator.py` |
| LLM proxy | LiteLLM (Docker) at `http://localhost:4000` |
| Default LLM | Llama 3.3 70B via ZHAW Ollama (no API key required); `claude-sonnet-4-6` for stronger configurations |
| Reproducibility model | `ollama/gemma4:26b` via local Ollama |
| Security tooling | HexStrike MCP — Kali Linux toolkit (~150 tools), Docker container at `:8889` |
| Web interaction | Playwright MCP (stdio via `npx`) |
| Tool routing | `MCPClientManager` — dynamic discovery, zero hardcoded tools |
## See also
- [Documentation index](docs/README.md) — full guide to setup, commands, architecture, benchmarks
- [AGENTS.md](AGENTS.md) — agent technical reference (`task_response` schema, FLAG pattern)
- [CLAUDE.md](CLAUDE.md) — project conventions for Claude Code editing this repo