steveh250/Prompt-Injection-Testing

GitHub: steveh250/Prompt-Injection-Testing

Stars: 0 | Forks: 0

# Prompt Injection Testing ## Purpose This repository tests and compares different approaches to defending AI agents against **prompt injection attacks** — malicious instructions embedded in external content (documents, emails, API responses) that try to hijack an LLM's behaviour. ## Repository Structure Prompt-Injection-Testing/ ├── README.md # This file ├── security_agent-Prompt_INJECTION_And_Benign_DATASET.jsonl # Shared test dataset ├── Ollama/ # LLM-based inline security agent │ ├── README.md │ ├── ARCHITECTURE.md │ ├── security_agent.py │ └── test_security_agent.py └── MAF-FIDES/ # FIDES content-labelling approach ├── README.md ├── ARCHITECTURE.md ├── fides_security_agent.py ├── test_fides_agent.py └── requirements.txt ## Shared Dataset `security_agent-Prompt_INJECTION_And_Benign_DATASET.jsonl` A curated dataset of **500 labelled prompts** (250 malicious, 250 benign) used by both test harnesses to enable direct comparison. | Field | Description | |---|---| | `id` | Unique identifier (e.g. `pi-001`) | | `prompt` | The raw text to classify | | `label` | `malicious` or `benign` | | `attack_type` | `code_execution`, `obfuscation`, `jailbreaking`, `data_leakage`, `role_playing`, or `none` | | `context` | Human-readable description of the attack or query | | `response` | Expected agent response | **Attack type distribution:** | Attack Type | Count | |---|---| | code_execution | 146 | | obfuscation | 61 | | data_leakage | 18 | | jailbreaking | 17 | | role_playing | 8 | | none (benign) | 250 | ## Approaches Compared ### 1. Ollama — Inline LLM Fire Break **Folder:** `Ollama/` An **inline fire break** that sits between the document-extraction step and the downstream execution agent in the RFP Responder pipeline. Every extracted requirement is scanned before it is passed forward. If the security agent detects a prompt injection attack, the pipeline is **aborted immediately** — the payload never reaches a downstream LLM that could act on it. RFP document → extract requirements → [Security Agent] ──malicious──► ABORT └──benign────► downstream agent - The LLM receives the raw content and applies a detailed threat-detection system prompt. - Two-phase analysis: per-node scan + full-structure scan for split-payload attacks. - On malicious detection: pipeline halts (exit code 2 in standalone mode; `is_malicious: true` in A2A mode). - On benign verdict: content is passed through to the next pipeline stage. See [`Ollama/README.md`](Ollama/README.md) and [`Ollama/ARCHITECTURE.md`](Ollama/ARCHITECTURE.md). ### 2. MAF-FIDES — Content Labelling + Quarantine Isolation **Folder:** `MAF-FIDES/` An implementation of Microsoft's **FIDES** (Foundational Integration Defense for Execution Security) approach from the [Microsoft Agent Framework](https://github.com/microsoft/agent-framework/tree/main/python/samples/02-agents/security). Rather than asking an LLM to detect attacks in raw content, FIDES **prevents** injection structurally: - All external input is labelled `UNTRUSTED`. - A middleware layer **hides** untrusted content behind an opaque variable reference before it reaches the main LLM. - The main LLM never sees raw untrusted text; it only sees `[UNTRUSTED_CONTENT_REF: var_xxxxxxxx]`. - When classification is needed, the agent calls a `quarantined_llm` tool that processes the hidden content in complete isolation with no tool access. See [`MAF-FIDES/README.md`](MAF-FIDES/README.md) and [`MAF-FIDES/ARCHITECTURE.md`](MAF-FIDES/ARCHITECTURE.md). ## Key Distinction Between Approaches | Dimension | Ollama Approach | FIDES Approach | |---|---|---| | **Pipeline role** | Inline fire break — aborts the pipeline on detection | Inline gate — blocks downstream tool calls on detection | | **On malicious detection** | Pipeline halted immediately (abort / exit code 2) | Downstream agent actions blocked by policy enforcement | | **On benign verdict** | Content passes through to the next pipeline stage | Content passes through; main agent proceeds normally | | **Defence mechanism** | Probabilistic detection — LLM classifies raw content | Structural prevention (hiding) + probabilistic quarantine | | **Raw content seen by main LLM** | Yes — sentinel LLM reads the raw payload | Never — raw payload is hidden before any LLM sees it | | **Injection vector** | Sentinel LLM may be tricked by a sufficiently clever payload | Structurally closed for main agent; quarantine LLM is isolated | | **Classification method** | Direct LLM analysis with security system prompt | Isolated quarantine LLM with explicit data-framing | | **False negative risk** | Higher — novel attacks may fool the sentinel LLM | Lower — quarantine framing and isolation reduce susceptibility | | **False positive risk** | Moderate | Moderate | | **Explainability** | Full scratchpad reasoning in output | Full scratchpad reasoning from quarantine LLM + middleware event log | ## Running the Comparisons Both harnesses produce the same set of metrics (accuracy, precision, recall, F1, confusion matrix) from the same dataset, making results directly comparable. # Ollama approach cd Ollama python test_security_agent.py --limit 20 # quick test python test_security_agent.py # full 500-prompt run # FIDES approach cd MAF-FIDES pip install -r requirements.txt python test_fides_agent.py --limit 20 # quick test python test_fides_agent.py # full 500-prompt run Both scripts accept `--limit N`, `--start N`, and `--output path/to/results.json`. ## Prerequisites - **Ollama** running locally at `http://localhost:11434` - **Granite 4** model pulled: `ollama pull granite4:latest` - Python 3.11+, `pip install openai`