parthtiwari1/prompt-injection-capstone

GitHub: parthtiwari1/prompt-injection-capstone

一个系统性的 LLM 提示词注入攻防研究框架,通过创新的 Chain-of-Thought Warden Agent 在模型推理过程中实时检测攻击,并对比评估多种攻击手段和传统防御措施的效果。

Stars: 0 | Forks: 0

# Prompt Injection 毕业设计项目 ## 使用 Chain-of-Thought Warden Agent 提高 LLM 聊天机器人的 Prompt Injection 防御能力 **大学:** UTS — 36105 iLab: Capstone Project **学期:** 第 4 学期 **客户:** Dr. Angelica Chowdhury ## 👥 团队 - Parth Tiwari - Ryan Cruikshank - Larry Iglesias - Rasyid Sahindra - Masaru Nagaishi - Masaki Kawakami ## 📌 项目概述 大型语言模型 容易受到 prompt injection 攻击,恶意输入会覆盖模型的原始指令。本项目: 1. **评估** 已知的 prompt injection 攻击和传统防御措施在多个 LLM 上的表现 2. **提出并评估** 一种新颖的 Chain-of-Thought (CoT) Warden agent,该 agent 实时监控中间推理步骤,以便比传统的基于输出的防御更早地检测到注入劫持 ## 📁 仓库结构 ``` prompt-injection-capstone/ ├── attacks/ │ ├── direct_injection.py # Instruction override, prompt leakage, context confusion │ ├── indirect_injection.py # Document/attachment-based injection │ ├── roleplay_injection.py # DAN, persona hijack, fictional framing │ ├── goal_hijacking.py # Appending malicious tasks to legitimate requests │ ├── obfuscation.py # Base64, ROT13, leetspeak, multilingual attacks │ ├── run_qwen.py # Run all attacks on Qwen 2.5 │ └── run_gemini.py # Run all attacks on Gemini Flash ├── mitigations/ │ ├── prompt_hardening.py # Explicit counter-instruction in system prompt │ ├── input_sanitisation.py # Pattern-based filtering before LLM call │ ├── output_filtering.py # Response scanning after LLM generation │ └── spotlighting.py # XML delimiter tagging (Hines et al., 2024) ├── warden/ │ ├── warden_agent.py # Rule-based Warden (pattern matching) │ ├── warden_llm.py # LLM-based Warden (second LLM as judge) │ └── warden_classifier.py # Classifier-based Warden (TF-IDF keyword classifier) ├── evaluation/ │ ├── evaluate.py # Single attack vs defence comparison │ └── evaluate_all.py # Master evaluation across all attacks and defences ├── results/ │ └── *.json # All experimental results stored here └── README.md ``` ## 🚀 入门指南 ### 前置条件 - Python 3.x - 本地已安装 [Ollama](https://ollama.com) - 已拉取 Llama 3.2 和 Qwen 2.5 模型 - Google Gemini API key (可在 https://aistudio.google.com 免费获取) ### 安装 ``` # Clone repository git clone https://github.com/parthtiwari1/prompt-injection-capstone.git cd prompt-injection-capstone # Create virtual environment python -m venv venv # Activate virtual environment # On Windows: venv\Scripts\activate # On Mac/Linux: source venv/bin/activate # Install dependencies pip install ollama google-generativeai ``` ### 拉取 LLM 模型 ``` ollama pull llama3.2 ollama pull qwen2.5 ``` ### 设置 Gemini API key ``` # Windows $env:GEMINI_API_KEY="your-key-here" # Mac/Linux export GEMINI_API_KEY="your-key-here" ``` ## ▶️ 运行实验 ### 运行攻击 ``` # Llama 3.2 python attacks/direct_injection.py python attacks/indirect_injection.py python attacks/roleplay_injection.py python attacks/goal_hijacking.py python attacks/obfuscation.py # Qwen 2.5 python attacks/run_qwen.py # Gemini Flash python attacks/run_gemini.py ``` ### 运行防御 ``` python mitigations/prompt_hardening.py python mitigations/input_sanitisation.py python mitigations/output_filtering.py python mitigations/spotlighting.py ``` ### 运行 Warden agent ``` python warden/warden_agent.py # Rule-based python warden/warden_llm.py # LLM-based python warden/warden_classifier.py # Classifier-based ``` ### 运行主评估 ``` python evaluation/evaluate_all.py ``` ## 📊 结果 ### 攻击基线 (无防御) | Attack Type | Llama 3.2 | Qwen 2.5 | Gemini Flash | |---|---|---|---| | Direct Injection | 0% | 0% | 0% | | Indirect/Document | 20% | 33.3% | 0% | | Role-play | 33.3% | 0% | 0% | | Goal Hijacking | 20% | 33.3% | 0% | | Obfuscation | 33.3% | 33.3% | 0% | | **Overall** | **20%** | **23.3%** | **0%** | ### 传统防御对比 (Llama 3.2) | Defence | ASR | Improvement | |---|---|---| | No Defence (baseline) | 20.0% | — | | Output Filtering | 20.0% | ▼ 0.0pp | | Prompt Hardening | 16.7% | ▼ 3.3pp | | Spotlighting | 10.0% | ▼ 10.0pp | | **Input Sanitisation** | **6.7%** | **▼ 13.3pp ✅ Best** | ### Warden Agent 结果 (Llama 3.2) | Warden Variant | Detection Rate | MTTD | Speed | |---|---|---|---| | Rule-based | 13.3% | 1.0 steps | Fast | | LLM-based | 20.0% | 1.5 steps | Slow | | **Classifier-based** | **56.7%** | **1.2 steps** | **Fast ✅ Best** | ## 🔑 关键发现 1. **Gemini Flash** 阻止了所有 30 次攻击,ASR 为 0% — 内置安全性最强 2. **Llama 3.2** 最容易受到 role-play 和 obfuscation 攻击 (ASR 均为 33.3%) 3. **Qwen 2.5** 最容易受到网页内容注入和虚构框架攻击 (ASR 为 100%) 4. **Input Sanitisation** 是最好的传统防御措施 (在 Llama 3.2 上 ASR 为 6.7%) 5. **基于 Classifier 的 Warden** 达到了 56.7% 的检测率,MTTD 为 1.2 步 6. Warden agent 在 **推理层面** 捕获攻击,防止产生有害输出 ## 📚 主要参考文献 - Liu, Yupei et al. (2023). Formalizing and Benchmarking Prompt Injection Attacks and Defenses. https://arxiv.org/abs/2310.12815 - Liu, Yi et al. (2023). Prompt Injection Attack Against LLM-Integrated Applications. https://arxiv.org/abs/2306.05499 - Greshake, K. et al. (2023). Not What You've Signed Up For. https://arxiv.org/abs/2302.12173 - Hines, K. et al. (2024). Defending Against Indirect Prompt Injection with Spotlighting. https://arxiv.org/abs/2403.14720 - Korbak, T. et al. (2025). Chain of Thought Monitorability. https://arxiv.org/abs/2507.11473 - Robey, A. et al. (2023). SmoothLLM. https://arxiv.org/abs/2310.03684 ## 📝 评估指标 - **ASR (Attack Success Rate):** 成功绕过防御的攻击比例 - **MTTD (Mean Time to Detect):** Warden 检测到劫持前的平均 CoT 步数 - **MTTR (Mean Time to Respond):** 从检测到响应抑制的时间 - **FPR (False Positive Rate):** 被错误阻止的合法请求的比例 ## 🔮 后续步骤 - [ ] 使用 LLM-as-judge 评估以获得更准确的 ASR 测量 - [ ] 在 Qwen 2.5 上进行跨模型 Warden 评估 - [ ] False positive rate 测量 - [ ] 最终报告汇编 - [ ] 实时演示准备
标签:AI安全, AI风险缓解, Chain-of-Thought, Chat Copilot, CoT, DLL 劫持, Gemini, Naabu, Python, Qwen, UTS, Warden Agent, 上下文混淆, 内容安全, 大语言模型, 思维链, 提示词加固, 无后门, 智能代理, 毕业设计, 目标劫持, 网络安全, 输入过滤, 输出过滤, 逆向工具, 隐私保护