parthtiwari1/prompt-injection-capstone
GitHub: parthtiwari1/prompt-injection-capstone
一个系统性的 LLM 提示词注入攻防研究框架,通过创新的 Chain-of-Thought Warden Agent 在模型推理过程中实时检测攻击,并对比评估多种攻击手段和传统防御措施的效果。
Stars: 0 | Forks: 0
# Prompt Injection 毕业设计项目
## 使用 Chain-of-Thought Warden Agent 提高 LLM 聊天机器人的 Prompt Injection 防御能力
**大学:** UTS — 36105 iLab: Capstone Project
**学期:** 第 4 学期
**客户:** Dr. Angelica Chowdhury
## 👥 团队
- Parth Tiwari
- Ryan Cruikshank
- Larry Iglesias
- Rasyid Sahindra
- Masaru Nagaishi
- Masaki Kawakami
## 📌 项目概述
大型语言模型 容易受到 prompt injection 攻击,恶意输入会覆盖模型的原始指令。本项目:
1. **评估** 已知的 prompt injection 攻击和传统防御措施在多个 LLM 上的表现
2. **提出并评估** 一种新颖的 Chain-of-Thought (CoT) Warden agent,该 agent 实时监控中间推理步骤,以便比传统的基于输出的防御更早地检测到注入劫持
## 📁 仓库结构
```
prompt-injection-capstone/
├── attacks/
│ ├── direct_injection.py # Instruction override, prompt leakage, context confusion
│ ├── indirect_injection.py # Document/attachment-based injection
│ ├── roleplay_injection.py # DAN, persona hijack, fictional framing
│ ├── goal_hijacking.py # Appending malicious tasks to legitimate requests
│ ├── obfuscation.py # Base64, ROT13, leetspeak, multilingual attacks
│ ├── run_qwen.py # Run all attacks on Qwen 2.5
│ └── run_gemini.py # Run all attacks on Gemini Flash
├── mitigations/
│ ├── prompt_hardening.py # Explicit counter-instruction in system prompt
│ ├── input_sanitisation.py # Pattern-based filtering before LLM call
│ ├── output_filtering.py # Response scanning after LLM generation
│ └── spotlighting.py # XML delimiter tagging (Hines et al., 2024)
├── warden/
│ ├── warden_agent.py # Rule-based Warden (pattern matching)
│ ├── warden_llm.py # LLM-based Warden (second LLM as judge)
│ └── warden_classifier.py # Classifier-based Warden (TF-IDF keyword classifier)
├── evaluation/
│ ├── evaluate.py # Single attack vs defence comparison
│ └── evaluate_all.py # Master evaluation across all attacks and defences
├── results/
│ └── *.json # All experimental results stored here
└── README.md
```
## 🚀 入门指南
### 前置条件
- Python 3.x
- 本地已安装 [Ollama](https://ollama.com)
- 已拉取 Llama 3.2 和 Qwen 2.5 模型
- Google Gemini API key (可在 https://aistudio.google.com 免费获取)
### 安装
```
# Clone repository
git clone https://github.com/parthtiwari1/prompt-injection-capstone.git
cd prompt-injection-capstone
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On Mac/Linux:
source venv/bin/activate
# Install dependencies
pip install ollama google-generativeai
```
### 拉取 LLM 模型
```
ollama pull llama3.2
ollama pull qwen2.5
```
### 设置 Gemini API key
```
# Windows
$env:GEMINI_API_KEY="your-key-here"
# Mac/Linux
export GEMINI_API_KEY="your-key-here"
```
## ▶️ 运行实验
### 运行攻击
```
# Llama 3.2
python attacks/direct_injection.py
python attacks/indirect_injection.py
python attacks/roleplay_injection.py
python attacks/goal_hijacking.py
python attacks/obfuscation.py
# Qwen 2.5
python attacks/run_qwen.py
# Gemini Flash
python attacks/run_gemini.py
```
### 运行防御
```
python mitigations/prompt_hardening.py
python mitigations/input_sanitisation.py
python mitigations/output_filtering.py
python mitigations/spotlighting.py
```
### 运行 Warden agent
```
python warden/warden_agent.py # Rule-based
python warden/warden_llm.py # LLM-based
python warden/warden_classifier.py # Classifier-based
```
### 运行主评估
```
python evaluation/evaluate_all.py
```
## 📊 结果
### 攻击基线 (无防御)
| Attack Type | Llama 3.2 | Qwen 2.5 | Gemini Flash |
|---|---|---|---|
| Direct Injection | 0% | 0% | 0% |
| Indirect/Document | 20% | 33.3% | 0% |
| Role-play | 33.3% | 0% | 0% |
| Goal Hijacking | 20% | 33.3% | 0% |
| Obfuscation | 33.3% | 33.3% | 0% |
| **Overall** | **20%** | **23.3%** | **0%** |
### 传统防御对比 (Llama 3.2)
| Defence | ASR | Improvement |
|---|---|---|
| No Defence (baseline) | 20.0% | — |
| Output Filtering | 20.0% | ▼ 0.0pp |
| Prompt Hardening | 16.7% | ▼ 3.3pp |
| Spotlighting | 10.0% | ▼ 10.0pp |
| **Input Sanitisation** | **6.7%** | **▼ 13.3pp ✅ Best** |
### Warden Agent 结果 (Llama 3.2)
| Warden Variant | Detection Rate | MTTD | Speed |
|---|---|---|---|
| Rule-based | 13.3% | 1.0 steps | Fast |
| LLM-based | 20.0% | 1.5 steps | Slow |
| **Classifier-based** | **56.7%** | **1.2 steps** | **Fast ✅ Best** |
## 🔑 关键发现
1. **Gemini Flash** 阻止了所有 30 次攻击,ASR 为 0% — 内置安全性最强
2. **Llama 3.2** 最容易受到 role-play 和 obfuscation 攻击 (ASR 均为 33.3%)
3. **Qwen 2.5** 最容易受到网页内容注入和虚构框架攻击 (ASR 为 100%)
4. **Input Sanitisation** 是最好的传统防御措施 (在 Llama 3.2 上 ASR 为 6.7%)
5. **基于 Classifier 的 Warden** 达到了 56.7% 的检测率,MTTD 为 1.2 步
6. Warden agent 在 **推理层面** 捕获攻击,防止产生有害输出
## 📚 主要参考文献
- Liu, Yupei et al. (2023). Formalizing and Benchmarking Prompt Injection Attacks and Defenses. https://arxiv.org/abs/2310.12815
- Liu, Yi et al. (2023). Prompt Injection Attack Against LLM-Integrated Applications. https://arxiv.org/abs/2306.05499
- Greshake, K. et al. (2023). Not What You've Signed Up For. https://arxiv.org/abs/2302.12173
- Hines, K. et al. (2024). Defending Against Indirect Prompt Injection with Spotlighting. https://arxiv.org/abs/2403.14720
- Korbak, T. et al. (2025). Chain of Thought Monitorability. https://arxiv.org/abs/2507.11473
- Robey, A. et al. (2023). SmoothLLM. https://arxiv.org/abs/2310.03684
## 📝 评估指标
- **ASR (Attack Success Rate):** 成功绕过防御的攻击比例
- **MTTD (Mean Time to Detect):** Warden 检测到劫持前的平均 CoT 步数
- **MTTR (Mean Time to Respond):** 从检测到响应抑制的时间
- **FPR (False Positive Rate):** 被错误阻止的合法请求的比例
## 🔮 后续步骤
- [ ] 使用 LLM-as-judge 评估以获得更准确的 ASR 测量
- [ ] 在 Qwen 2.5 上进行跨模型 Warden 评估
- [ ] False positive rate 测量
- [ ] 最终报告汇编
- [ ] 实时演示准备
标签:AI安全, AI风险缓解, Chain-of-Thought, Chat Copilot, CoT, DLL 劫持, Gemini, Naabu, Python, Qwen, UTS, Warden Agent, 上下文混淆, 内容安全, 大语言模型, 思维链, 提示词加固, 无后门, 智能代理, 毕业设计, 目标劫持, 网络安全, 输入过滤, 输出过滤, 逆向工具, 隐私保护