eburke21/prompt-armor

GitHub: eburke21/prompt-armor

PromptArmor 是一个用于测试 LLM 提示注入防御的自动化红队评估平台，解决现有工具缺乏系统化攻击分类与可追溯评测的问题。

Stars: 0 | Forks: 0

# 🛡️ PromptArmor **一个 prompt injection 防御测试沙盒 —— 探索真实世界的攻击，对您的防御进行基准测试，并看清它们究竟在哪里失效。** Prompt injection 是 OWASP LLM 应用 Top 10 安全风险中的[第一大漏洞](https://owasp.org/www-project-top-10-for-large-language-model-applications/)。然而，大多数部署 LLM 的团队都没有一种系统的方法来测试他们的 system prompt 或 guardrails 是否真正能够抵御已知的攻击。PromptArmor 填补了这一空白。 ## ✨ 功能 ### 🔍 攻击分类浏览器探索源自 4 个 Hugging Face 数据集的 **194,000+ 真实世界 prompt injection 尝试**。每个 prompt 都被归类到 10 种技术类别之一（指令重写、角色扮演漏洞利用、编码欺骗、上下文操纵等），并带有 1–5 的难度评级。 ### ⚔️ 防御沙盒配置多层防御，并使用真实攻击对其进行压力测试： - **System prompt 强化** — 编写您自己的预设，或选择一个预设（弱 → 强） - **输入过滤器** — 关键词黑名单 + 带有可调阈值的 OpenAI Moderation API - **输出过滤器** — 使用精确字符串和正则匹配进行机密泄露检测 - **攻击选择** — 选择技术、难度范围、prompt 数量以及无害样本混合比例 ### 📊 实时结果与记分卡通过 SSE 实时观看随着每个 prompt 运行通过您的防御 pipeline 而流式传输的结果。完成后，您将获得一个包含以下内容的记分卡： - 🟢 整体攻击拦截率（动态环形图） - 🔴 误报率 - 📈 按技术分类的拦截率（哪些攻击突破防线了？） - 📉 按难度分类的拦截率（您的防御具备扩展性吗？） - 🧱 按防御层级的拦截分布（输入过滤器 vs. LLM 拒绝 vs. 输出过滤器） ### 🔗 可分享的结果每次评估运行都会获得一个唯一的 URL (`/sandbox/:runId`)。将其添加到书签、与您的团队分享，或稍后再回来查看 — 记分卡会持久保存。 ## 🏗️ 架构 ``` ┌───────────────────────────────────────────────────────┐ │ FRONTEND │ │ React · TypeScript · Chakra UI v3 │ │ │ │ Dashboard · Taxonomy · Sandbox · Results │ │ Browser Config Scorecard │ └───────────┬───────────────────────┬───────────────────┘ │ SSE stream │ ▼ ▼ ┌───────────────────────────────────────────────────────┐ │ BACKEND │ │ FastAPI · Python 3.12+ │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌─────────────────┐ │ │ │ Dataset │ │ Defense │ │ Evaluation │ │ │ │ Service │ │ Pipeline │ │ Scoring │ │ │ │ │ │ │ │ │ │ │ │ Query & │ │ Input filter │ │ Classify │ │ │ │ filter │ │ → Claude LLM │ │ injection │ │ │ │ attacks │ │ → Output │ │ success & │ │ │ │ │ │ filter │ │ aggregate │ │ │ └────┬─────┘ └──────┬───────┘ └────────┬────────┘ │ │ │ │ │ │ │ ┌────▼────┐ ┌──────▼──────┐ ┌────────▼───────┐ │ │ │ SQLite │ │ Claude API │ │ OpenAI │ │ │ │ (local) │ │ (target) │ │ Moderation API │ │ │ └─────────┘ └─────────────┘ └────────────────┘ │ └───────────────────────────────────────────────────────┘ ``` ### 🔄 防御 Pipeline（每个 prompt） ``` User Prompt │ ▼ ┌──────────────┐ blocked ┌──────────┐ │ Input Filter ├──────────────►│ BLOCKED │ │ (keyword / │ │ (skip │ │ moderation) │ │ LLM) │ └──────┬───────┘ └──────────┘ │ passed ▼ ┌──────────────┐ │ Claude LLM │ ◄─── system prompt + user prompt │ (target) │ └──────┬───────┘ │ response ▼ ┌──────────────┐ blocked ┌──────────┐ │ Output Filter├──────────────►│ BLOCKED │ │ (secret leak │ │ │ │ detector) │ └──────────┘ └──────┬───────┘ │ passed ▼ ┌──────────────┐ │ Score Result │ ──► injection succeeded? refused? false positive? └──────────────┘ ``` ## 🚀 快速开始 ### 前置条件 - **Python 3.12+**（带有 [uv](https://docs.astral.sh/uv/) 包管理器） - **Node.js 18+**（带有 npm） - **Anthropic API key**（用于 Claude — LLM 目标） - **OpenAI API key** *（可选，用于审核过滤器）* ### 1️⃣ 克隆与配置 ``` git clone https://github.com/your-username/prompt-armor.git cd prompt-armor cp .env.example .env # 编辑 .env 并添加你的 API keys： # ANTHROPIC_API_KEY=sk-ant-... # OPENAI_API_KEY=sk-... （可选） ``` ### 2️⃣ 导入数据集 ``` cd backend uv sync --all-extras uv run python -m promptarmor.ingestion --skip-llm # ⏱️ 约 2-3 分钟 — 下载 4 个 HF datasets，对 19.4 万条 prompts 进行标准化与分类 ``` ### 3️⃣ 启动后端 ``` uv run uvicorn promptarmor.main:app --port 8000 --reload # ✅ API 运行于 http://localhost:8000 # 📖 Swagger 文档位于 http://localhost:8000/docs ``` ### 4️⃣ 启动前端 ``` cd ../frontend npm install npm run dev # ✅ App 运行于 http://localhost:5173 ``` ### 5️⃣ 开始体验！ 1. 打开 **http://localhost:5173** 2. 浏览攻击分类 —— 10 种技术类别，194K+ 个 prompt 3. 前往 **Sandbox** → 选择一个 system prompt 预设 → 启用关键词黑名单 → 点击 **Run Test** 4. 观看实时流式传输的结果 → 查看您的记分卡 📊 ## 📁 项目结构 ``` prompt-armor/ ├── backend/ │ ├── promptarmor/ │ │ ├── main.py # FastAPI app entrypoint │ │ ├── config.py # Pydantic settings (env vars) │ │ ├── database.py # SQLite schema + async connection │ │ ├── models/ # Pydantic v2 request/response models │ │ ├── routers/ # API route handlers │ │ │ ├── taxonomy.py # GET /api/v1/taxonomy │ │ │ ├── attacks.py # GET /api/v1/attacks │ │ │ ├── system_prompts.py# GET /api/v1/system-prompts │ │ │ └── eval.py # POST + SSE /api/v1/eval/run │ │ ├── services/ # Business logic │ │ │ ├── filters.py # Input filter pipeline │ │ │ ├── output_filters.py# Output filter pipeline │ │ │ ├── llm_target.py # Claude API executor │ │ │ ├── scoring.py # Injection classifier + scorecard │ │ │ ├── attack_selector.py# Stratified attack sampling │ │ │ └── eval_runner.py # Pipeline orchestrator (SSE generator) │ │ ├── middleware/ # Rate limiting │ │ └── ingestion/ # HF dataset download + classification │ └── tests/ # 91 tests (pytest + pytest-asyncio) ├── frontend/ │ └── src/ │ ├── api/ # Typed fetch client + SSE helper │ ├── components/ # Layout, LiveResultsStream, ScorecardView │ ├── pages/ # Dashboard, TaxonomyBrowser, Sandbox, etc. │ └── theme/ # Chakra UI v3 system config + constants ├── data/ # SQLite DB (generated by ingestion) ├── docker-compose.yml # Container setup (WIP) └── .env.example # Required environment variables ``` ## 🧪 测试 ``` cd backend # 运行全部 91 个测试 uv run pytest -v # 运行特定的 test suites uv run pytest tests/test_filters.py -v # 🔒 Input/output filter tests uv run pytest tests/test_scoring.py -v # 📊 Scoring + scorecard tests uv run pytest tests/test_rate_limit.py -v # 🚦 Rate limiter tests uv run pytest tests/test_classifier.py -v # 🏷️ Technique classifier tests # Linting 与 type checking uv run ruff check . # 🧹 Lint (zero issues) uv run mypy promptarmor/ # 🔍 Strict type check (zero issues) ``` ``` cd frontend # Type check 与 lint npx tsc --noEmit # ✅ Zero TypeScript errors npx eslint src/ # ✅ Zero lint errors # Production build npm run build # 📦 Builds to dist/ ``` ## 🛠️ 技术栈 | 层级 | 技术 | 原因 | |-------|-----------|-----| | 🐍 后端 | **FastAPI** + Python 3.12 | Async 优先，Pydantic v2 验证，自动生成 OpenAPI 文档 | | 💾 数据库 | **SQLite** + aiosqlite | 零配置，单文件 DB，非常适合本地优先工具 | | ⚛️ 前端 | **React 19** + TypeScript + Vite | 类型安全，快速 HMR，现代化工具 | | 🎨 UI | **Chakra UI v3** | 组件组合，暗黑模式，默认无障碍访问 | | 📊 图表 | **Recharts** | 使用 React 组件的声明式图表 | | 🔄 数据获取 | **TanStack Query** | 缓存，后台重新获取，条件轮询 | | 📡 实时通信 | **SSE** (Server-Sent Events) | 对于单向流式传输比 WebSockets 更简单 | | 🤖 LLM | **Claude** (Anthropic API) | 用于防御测试的目标模型 | | 🛑 审核 | **OpenAI Moderation API** | 免费的内容分类，作为输入过滤层 | | 📦 包管理 | **uv** (Python) + npm | 快速、现代化的依赖解析 | | ✅ 质量 | **ruff** + **mypy** (严格模式) + **ESLint** + **Prettier** | 零容忍的 Linting，完整的类型覆盖 | ## 📊 数据集 PromptArmor 导入并标准化了来自 4 个 Hugging Face 数据集的 prompt： | 数据集 | Prompts | 类型 | 许可证 | |---------|---------|------|---------| | 🏰 [Lakera/mosscap](https://huggingface.co/datasets/Lakera/mosscap_prompt_injection) | ~173K | DEF CON 31 CTF 攻击 | MIT | | 📝 [SPML Chatbot](https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection) | ~16K | System prompt + injection 对 | MIT | | 🧪 [neuralchemy](https://huggingface.co/datasets/neuralchemy/Prompt-injection-dataset) | ~4.6K | 平衡的 injection/无害样本 | Apache 2.0 | | 🏷️ [deepset](https://huggingface.co/datasets/deepset/prompt-injections) | ~662 | 干净的标注数据 (EN + DE) | Apache 2.0 | **总计：194,202 个 prompt**，涵盖 10 种技术类别 + 未分类，难度估计为 1–5。 ## 🗺️ 路线图 - [x] 📦 阶段 1 — 项目脚手架与数据库基础 - [x] 📥 阶段 2 — 数据集导入 pipeline（4 个 HF 数据集，194K 个 prompt） - [x] ⚔️ 阶段 3 — 后端防御 pipeline（过滤器 → LLM → 评分 → SSE） - [x] 🖥️ 阶段 4 — 前端（分类浏览器、沙盒、实时结果、记分卡） - [ ] ⚖️ 阶段 5 — 对比模式（并排防御评估） - [ ] 📄 阶段 6 — 红队报告生成器（Markdown 导出） - [ ] 🚢 阶段 7 — 优化、部署、演示视频 ## 📜 许可证 MIT

🛡️ 使用 FastAPI、React 和 Claude 构建

标签：DLL 劫持, LLM安全, 人工智能, 大语言模型, 测试工具, 用户模式Hook绕过, 红队评估