eburke21/prompt-armor
GitHub: eburke21/prompt-armor
PromptArmor 是一个用于测试 LLM 提示注入防御的自动化红队评估平台,解决现有工具缺乏系统化攻击分类与可追溯评测的问题。
Stars: 0 | Forks: 0
# 🛡️ PromptArmor
**A prompt injection defense testing sandbox — explore real-world attacks, benchmark your defenses, and see exactly where they fail.**
Prompt injection is the [#1 vulnerability](https://owasp.org/www-project-top-10-for-large-language-model-applications/) in the OWASP Top 10 for LLM Applications. Yet most teams deploying LLMs have no systematic way to test whether their system prompt or guardrails actually resist known attacks. PromptArmor fills that gap.
## ✨ Features
### 🔍 Attack Taxonomy Browser
Explore **194,000+ real-world prompt injection attempts** sourced from 4 Hugging Face datasets. Every prompt is classified into one of 10 technique categories (instruction override, roleplay exploit, encoding tricks, context manipulation, and more) with difficulty ratings from 1–5.
### ⚔️ Defense Sandbox
Configure a multi-layered defense and stress-test it against real attacks:
- **System prompt hardening** — write your own or pick a preset (weak → strong)
- **Input filters** — keyword blocklist + OpenAI Moderation API with tunable thresholds
- **Output filters** — secret leak detection with exact string and regex matching
- **Attack selection** — choose techniques, difficulty range, prompt count, and benign mix ratio
### 📊 Live Results & Scorecard
Watch results stream in real-time via SSE as each prompt runs through your defense pipeline. When complete, get a scorecard with:
- 🟢 Overall attack block rate (animated ring chart)
- 🔴 False positive rate
- 📈 Block rate by technique (which attacks get through?)
- 📉 Block rate by difficulty (does your defense scale?)
- 🧱 Blocks by defense layer (input filter vs. LLM refusal vs. output filter)
### 🔗 Shareable Results
Every eval run gets a unique URL (`/sandbox/:runId`). Bookmark it, share it with your team, or come back later — the scorecard persists.
## 🏗️ Architecture
┌───────────────────────────────────────────────────────┐
│ FRONTEND │
│ React · TypeScript · Chakra UI v3 │
│ │
│ Dashboard · Taxonomy · Sandbox · Results │
│ Browser Config Scorecard │
└───────────┬───────────────────────┬───────────────────┘
│ SSE stream │
▼ ▼
┌───────────────────────────────────────────────────────┐
│ BACKEND │
│ FastAPI · Python 3.12+ │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Dataset │ │ Defense │ │ Evaluation │ │
│ │ Service │ │ Pipeline │ │ Scoring │ │
│ │ │ │ │ │ │ │
│ │ Query & │ │ Input filter │ │ Classify │ │
│ │ filter │ │ → Claude LLM │ │ injection │ │
│ │ attacks │ │ → Output │ │ success & │ │
│ │ │ │ filter │ │ aggregate │ │
│ └────┬─────┘ └──────┬───────┘ └────────┬────────┘ │
│ │ │ │ │
│ ┌────▼────┐ ┌──────▼──────┐ ┌────────▼───────┐ │
│ │ SQLite │ │ Claude API │ │ OpenAI │ │
│ │ (local) │ │ (target) │ │ Moderation API │ │
│ └─────────┘ └─────────────┘ └────────────────┘ │
└───────────────────────────────────────────────────────┘
### 🔄 Defense Pipeline (per prompt)
User Prompt
│
▼
┌──────────────┐ blocked ┌──────────┐
│ Input Filter ├──────────────►│ BLOCKED │
│ (keyword / │ │ (skip │
│ moderation) │ │ LLM) │
└──────┬───────┘ └──────────┘
│ passed
▼
┌──────────────┐
│ Claude LLM │ ◄─── system prompt + user prompt
│ (target) │
└──────┬───────┘
│ response
▼
┌──────────────┐ blocked ┌──────────┐
│ Output Filter├──────────────►│ BLOCKED │
│ (secret leak │ │ │
│ detector) │ └──────────┘
└──────┬───────┘
│ passed
▼
┌──────────────┐
│ Score Result │ ──► injection succeeded? refused? false positive?
└──────────────┘
## 🚀 Quick Start
### Prerequisites
- **Python 3.12+** (with [uv](https://docs.astral.sh/uv/) package manager)
- **Node.js 18+** (with npm)
- **Anthropic API key** (for Claude — the LLM target)
- **OpenAI API key** *(optional, for the moderation filter)*
### 1️⃣ Clone & configure
git clone https://github.com/your-username/prompt-armor.git
cd prompt-armor
cp .env.example .env
# Edit .env and add your API keys:
# ANTHROPIC_API_KEY=sk-ant-...
# OPENAI_API_KEY=sk-... (optional)
### 2️⃣ Ingest the datasets
cd backend
uv sync --all-extras
uv run python -m promptarmor.ingestion --skip-llm
# ⏱️ ~2-3 min — downloads 4 HF datasets, normalizes & classifies 194K prompts
### 3️⃣ Start the backend
uv run uvicorn promptarmor.main:app --port 8000 --reload
# ✅ API running at http://localhost:8000
# 📖 Swagger docs at http://localhost:8000/docs
### 4️⃣ Start the frontend
cd ../frontend
npm install
npm run dev
# ✅ App running at http://localhost:5173
### 5️⃣ Try it!
1. Open **http://localhost:5173**
2. Browse the attack taxonomy — 10 technique categories, 194K+ prompts
3. Go to **Sandbox** → pick a system prompt preset → enable a keyword blocklist → hit **Run Test**
4. Watch results stream in live → see your scorecard 📊
## 📁 Project Structure
prompt-armor/
├── backend/
│ ├── promptarmor/
│ │ ├── main.py # FastAPI app entrypoint
│ │ ├── config.py # Pydantic settings (env vars)
│ │ ├── database.py # SQLite schema + async connection
│ │ ├── models/ # Pydantic v2 request/response models
│ │ ├── routers/ # API route handlers
│ │ │ ├── taxonomy.py # GET /api/v1/taxonomy
│ │ │ ├── attacks.py # GET /api/v1/attacks
│ │ │ ├── system_prompts.py# GET /api/v1/system-prompts
│ │ │ └── eval.py # POST + SSE /api/v1/eval/run
│ │ ├── services/ # Business logic
│ │ │ ├── filters.py # Input filter pipeline
│ │ │ ├── output_filters.py# Output filter pipeline
│ │ │ ├── llm_target.py # Claude API executor
│ │ │ ├── scoring.py # Injection classifier + scorecard
│ │ │ ├── attack_selector.py# Stratified attack sampling
│ │ │ └── eval_runner.py # Pipeline orchestrator (SSE generator)
│ │ ├── middleware/ # Rate limiting
│ │ └── ingestion/ # HF dataset download + classification
│ └── tests/ # 91 tests (pytest + pytest-asyncio)
├── frontend/
│ └── src/
│ ├── api/ # Typed fetch client + SSE helper
│ ├── components/ # Layout, LiveResultsStream, ScorecardView
│ ├── pages/ # Dashboard, TaxonomyBrowser, Sandbox, etc.
│ └── theme/ # Chakra UI v3 system config + constants
├── data/ # SQLite DB (generated by ingestion)
├── docker-compose.yml # Container setup (WIP)
└── .env.example # Required environment variables
## 🧪 Testing
cd backend
# Run all 91 tests
uv run pytest -v
# Run specific test suites
uv run pytest tests/test_filters.py -v # 🔒 Input/output filter tests
uv run pytest tests/test_scoring.py -v # 📊 Scoring + scorecard tests
uv run pytest tests/test_rate_limit.py -v # 🚦 Rate limiter tests
uv run pytest tests/test_classifier.py -v # 🏷️ Technique classifier tests
# Linting & type checking
uv run ruff check . # 🧹 Lint (zero issues)
uv run mypy promptarmor/ # 🔍 Strict type check (zero issues)
cd frontend
# Type check & lint
npx tsc --noEmit # ✅ Zero TypeScript errors
npx eslint src/ # ✅ Zero lint errors
# Production build
npm run build # 📦 Builds to dist/
## 🛠️ Tech Stack
| Layer | Technology | Why |
|-------|-----------|-----|
| 🐍 Backend | **FastAPI** + Python 3.12 | Async-first, Pydantic v2 validation, auto-generated OpenAPI docs |
| 💾 Database | **SQLite** + aiosqlite | Zero-config, single-file DB perfect for local-first tool |
| ⚛️ Frontend | **React 19** + TypeScript + Vite | Type safety, fast HMR, modern tooling |
| 🎨 UI | **Chakra UI v3** | Component composition, dark mode, accessible by default |
| 📊 Charts | **Recharts** | Declarative charts with React components |
| 🔄 Data fetching | **TanStack Query** | Caching, background refetch, conditional polling |
| 📡 Real-time | **SSE** (Server-Sent Events) | Simpler than WebSockets for unidirectional streaming |
| 🤖 LLM | **Claude** (Anthropic API) | Target model for defense testing |
| 🛑 Moderation | **OpenAI Moderation API** | Free content classification as an input filter layer |
| 📦 Package mgmt | **uv** (Python) + npm | Fast, modern dependency resolution |
| ✅ Quality | **ruff** + **mypy** (strict) + **ESLint** + **Prettier** | Zero-tolerance linting, full type coverage |
## 📊 Datasets
PromptArmor ingests and normalizes prompts from 4 Hugging Face datasets:
| Dataset | Prompts | Type | License |
|---------|---------|------|---------|
| 🏰 [Lakera/mosscap](https://huggingface.co/datasets/Lakera/mosscap_prompt_injection) | ~173K | DEF CON 31 CTF attacks | MIT |
| 📝 [SPML Chatbot](https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection) | ~16K | System prompt + injection pairs | MIT |
| 🧪 [neuralchemy](https://huggingface.co/datasets/neuralchemy/Prompt-injection-dataset) | ~4.6K | Balanced injection/benign | Apache 2.0 |
| 🏷️ [deepset](https://huggingface.co/datasets/deepset/prompt-injections) | ~662 | Clean labeled (EN + DE) | Apache 2.0 |
**Total: 194,202 prompts** across 10 technique categories + unclassified, with difficulty estimates 1–5.
## 🗺️ Roadmap
- [x] 📦 Phase 1 — Project scaffolding & database foundation
- [x] 📥 Phase 2 — Dataset ingestion pipeline (4 HF datasets, 194K prompts)
- [x] ⚔️ Phase 3 — Backend defense pipeline (filters → LLM → scoring → SSE)
- [x] 🖥️ Phase 4 — Frontend (taxonomy browser, sandbox, live results, scorecard)
- [ ] ⚖️ Phase 5 — Comparison mode (side-by-side defense evaluation)
- [ ] 📄 Phase 6 — Red team report generator (Markdown export)
- [ ] 🚢 Phase 7 — Polish, deploy, demo video
## 📜 License
MIT
🛡️ Built with FastAPI, React, and Claude
标签:Hugging Face数据集, OpenAI Moderation API, OWASP Top 10 LLM, Prompt注入攻击, SEO关键词, SSE, 上下文操纵, 关键词阻断, 分享链接, 可配置的防御, 多语言支持, 大语言模型安全, 安全测试框架, 实时流, 对比分析, 提示注入, 提示篡改, 攻击分类, 攻击难度分级, 数据持久化, 机密管理, 沙箱, 测试平台, 系统提示强化, 红队评估, 编码技巧, 角色利用, 输入过滤, 输出过滤, 逆向工具, 防御层, 防御测试, 集群管理