federicotorrielli/indirect-prompt-injection

GitHub: federicotorrielli/indirect-prompt-injection

研究 LLM 在 AI 辅助学术同行评审中面临的间接提示注入漏洞，并提供从攻击实现到结果评估的完整解决方案。

Stars: 2 | Forks: 0

# 间接提示注入：AI 生成论文评审本项目研究大型语言模型（LLM）在 AI 辅助学术同行评审中的间接提示注入漏洞。研究展示了恶意行为者如何在科学手稿中嵌入隐藏载荷，以操纵 LLM 生成的评审。 ## 🎯 概述本研究探索了 AI 辅助同行评审中的一个关键安全漏洞：通过在学术论文中嵌入不可见的指令来操纵 LLM 评审员。这种攻击向量威胁了科学评审过程的完整性，因为它允许论文通过受损的 AI 系统“自我评审”。该项目实现并评估了五种不同的攻击向量： 1. **拒绝攻击**：强制 LLM 拒绝生成任何评审 2. **正向引导攻击**：操纵 LLM 生成过于积极的评审 3. **负向引导攻击**：操纵 LLM 生成过于消极的评审 4. **水印攻击**：强制 LLM 包含特定短语用于追踪 5. **外部站点攻击**：将用户重定向至外部网站而非提供评审 ## 🚀 快速开始 ### 先决条件 - **Python 3.11+** - **UV 包管理器** ([安装指南](https://docs.astral.sh/uv/getting-started/installation/)) - **Chrome/Chromium 浏览器**（用于网页自动化） - **Git**（用于克隆仓库） ### 安装 ``` # 克隆仓库 git clone https://github.com/federicotorrielli/indirect-prompt-injection.git cd indirect-prompt-injection # 使用 UV 安装依赖（自动创建虚拟环境） uv sync # 激活虚拟环境 source .venv/bin/activate.fish # For fish shell # 或 source .venv/bin/activate # For bash/zsh # 设置项目目录并验证安装 #### uv run python scripts/setup_automation.py ## 📊 完整的实验复现指南 ### 阶段 1：数据集准备 The experiments use scientific papers from OpenReview as the base dataset. These papers are processed, analyzed, and prepared for injection. #### 步骤 1.1：分析数据集 ```bash # 下载并分析 OpenReview 数据集 #### uv run python src/data_preparation/analyze_dataset.py This script: - Downloads the `nhop/OpenReview` dataset from Hugging Face - Filters papers published up to November 2022 - Selects papers with "verbose" reviews (above 75th percentile in review length) - Generates analysis visualizations and reports - Saves the filtered dataset to `data/analysis/openreview_verbose_reviews.csv` **Expected Output:** - `data/analysis/openreview_verbose_reviews.csv`: Filtered dataset - `openreview_analysis.png`: Dataset statistics visualization - `data/analysis/dataset_analysis_report.txt`: Analysis summary #### 步骤 1.2：下载并准备 PDF **Note**: This step is optional as the project includes pre-processed, hand-refined PDFs in `data/redacted_pdfs/` that are ready for use in experiments. ```bash # 可选：下载 PDF 并脱敏会议信息 #### uv run python src/data_preparation/download_pdfs.py This script: - Downloads PDF files from OpenReview URLs - Redacts conference-specific information (ICLR, NeurIPS, etc.) - Saves anonymized PDFs to `data/raw_pdfs/` **Expected Output:** - PDFs in `data/raw_pdfs/`: Conference-anonymized research papers **Alternative**: Use the pre-processed PDFs in `data/redacted_pdfs/` which have been manually refined and are ready for injection experiments. #### 步骤 1.3：可选 Collu 论文集 If you also want to reproduce the Collu-style experiments supported in this repo, you can download the paper set used for that workflow: ```bash #### uv run python src/data_preparation/download_collu_pdfs.py Expected output: - `data/pdfs_collu/main_26_rejected/` - `data/pdfs_collu/transferability_2/` - `data/pdfs_collu/accepted_2_oral/` ### 阶段 2：提示注入 This phase embeds malicious prompts into the prepared PDFs using various techniques. #### 步骤 2.1：理解攻击类型 The system supports five attack types, each with different injection strategies: - **`refusal_attack`**: Forces LLM to refuse review generation - **`pos_steering_attack`**: Steers LLM toward positive reviews - **`neg_steering_attack`**: Steers LLM toward negative reviews - **`watermark_attack`**: Forces inclusion of specific tracking phrases - **`external_site_attack`**: Redirects to external websites #### 步骤 2.2：向 PDF 注入提示 ```bash # 基础注入（第一页的隐形文本） uv run python src/prompt_injection/inject_text.py \ --attack-type pos_steering_attack \ --prompt-type narrative \ --injection-locus first \ --font-size 1.0 # 注入所有攻击类型和变体 for attack_type in refusal_attack pos_steering_attack neg_steering_attack watermark_attack external_site_attack; do for prompt_type in narrative policy_puppetry; do for injection_locus in first last both; do echo "Injecting: $attack_type - $prompt_type - $injection_locus" uv run python src/prompt_injection/inject_text.py \ --attack-type "$attack_type" \ --prompt-type "$prompt_type" \ --injection-locus "$injection_locus" \ --font-size 1.0 done done #### 完成 **Injection Parameters:** - `--attack-type`: Type of attack (see list above) - `--prompt-type`: Variant (`narrative` or `policy_puppetry`) - `--injection-locus`: Location (`first`, `last`, or `both`) - `--font-size`: Size of injected text (1.0 for invisible, 6+ for visible) - `--ocr-model-mode`: Make text visible (useful for debugging) - `--insert-new-page`: Add prompt on new page instead of overlay **Expected Output:** - Injected PDFs in `data/injected_pdfs/`: PDFs with embedded malicious prompts #### 步骤 2.3：主 PDF 注入的工作原理 The main injection pipeline in `src/prompt_injection/inject_text.py` does not use PhantomText. Instead, it builds a PDF text overlay in memory and merges it into the target document. Mechanically: - The prompt is rendered as a full wrapping paragraph using ReportLab, not as a short raw text token sequence - In the default mode, the overlay sets the PDF text rendering mode to `3 Tr`, which means the text is present in the PDF content stream but not visually rendered - The overlay is then merged into the original PDF with `pypdf` - For `first`, the invisible paragraph is merged before the original first page content; for `last`, it is merged onto the last page near the bottom; for `both`, the prompt is injected at both loci - With `--insert-new-page`, the injected prompt is placed on a dedicated page rather than overlaid onto an existing one - With `--ocr-model-mode`, the script intentionally makes the injected content visible so OCR-style ingestion can be tested separately This design was chosen because it is more controllable for our experiments: - It preserves long prompts and line breaks reliably - It supports first-page, last-page, dual-locus, and dedicated-new-page attacks - It cleanly separates the standard hidden-text experiments from OCR-visible ones - It avoids depending on a third-party injection toolkit for the main study #### 步骤 2.4：可选 Collu 载荷注入 The repo also includes a separate Collu-style injection path using PhantomText. ```bash uv run python src/prompt_injection/inject_collu_payloads.py \ --input-dir data/pdfs_collu/main_26_rejected \ #### --output-dir data/injected_collu_pdfs This produces model-specific payload folders under `data/injected_collu_pdfs/`. Why this path is different from the main injector: - `inject_collu_payloads.py` is a faithful reproduction path for the Collu-style setup and uses PhantomText's zero-size injection workflow ### 阶段 3：LLM 自动化 This phase automatically tests the injected PDFs against different LLM services. #### 步骤 3.1：LLM 服务设置 ##### 选项 A：ChatGPT（默认） No additional setup required - uses web automation. ##### 选项 B：Microsoft Copilot No additional setup required - uses web automation. ##### 选项 C：Google Gemini（推荐设置） **Authentication Setup:** 1. Visit [gemini.google.com](https://gemini.google.com) and log in 2. Open browser developer tools (F12) 3. Go to Application/Storage > Cookies > 4. Find and copy these cookies: - `__Secure-1PSID` - `__Secure-1PSIDTS` 5. Set environment variables: ```fish # Fish shell set -gx GEMINI_SECURE_1PSID "your_1psid_cookie_value" #### set -gx GEMINI_SECURE_1PSIDTS "your_1psidts_cookie_value" ```bash # Bash/Zsh shell export GEMINI_SECURE_1PSID="your_1psid_cookie_value" #### export GEMINI_SECURE_1PSIDTS="your_1psidts_cookie_value" Or create a `.env` file: ```env GEMINI_SECURE_1PSID=your_1psid_cookie_value #### GEMINI_SECURE_1PSIDTS=your_1psidts_cookie_value #### 步骤 3.2：运行自动化实验 ##### 单一攻击类型测试 ```bash # 在 ChatGPT 上测试特定攻击 uv run python src/llm_automation/main.py \ --llm-service chatgpt \ --attack-type pos_steering_attack \ --injection-locus first # 在 Gemini 上测试水印攻击 uv run python src/llm_automation/main.py \ --llm-service gemini \ --attack-type watermark_attack \ --attack-mode policy \ #### --injection-locus first ##### 全面测试 ```bash # 在 ChatGPT 上测试所有攻击 uv run python src/llm_automation/main.py \ --llm-service chatgpt \ --reset-progress # 在 Gemini 上测试所有攻击 uv run python src/llm_automation/main.py \ --llm-service gemini \ --reset-progress # 在 Copilot 上测试所有攻击 uv run python src/llm_automation/main.py \ --llm-service copilot \ #### --reset-progress ##### 高级配置 ```bash # 恢复中断的实验 uv run python src/llm_automation/main.py \ --llm-service chatgpt # (automatically resumes from last checkpoint) ``` ##### 重复运行以确保统计可靠性由于 LLM 输出具有随机性，对每种攻击向量仅运行一次不足以可靠估计攻击成功率。请使用 `--run-id` 启动独立重复运行。每次运行会写入独立的结果和进度文件，因此这些运行可以连续执行（甚至在多台机器/账户上并行执行），而不会相互覆盖： ``` # 对完整的 ChatGPT 实验进行五次独立重复 uv run python src/llm_automation/main.py --llm-service chatgpt --run-id 1 uv run python src/llm_automation/main.py --llm-service chatgpt --run-id 2 uv run python src/llm_automation/main.py --llm-service chatgpt --run-id 3 uv run python src/llm_automation/main.py --llm-service chatgpt --run-id 4 #### uv run python src/llm_automation/main.py --llm-service chatgpt --run-id 5 Outputs land in: - `results/inference/all_results_chatgpt_run1.json` (and `_run2` ... `_run5`) - `results/inference/automation_progress_chatgpt_run1.json` (and `_run2` ... `_run5`) Each result record carries a `run_id` field so downstream analysis can aggregate mean/std/CI across repetitions. Omitting `--run-id` (or passing `--run-id 0`) preserves the legacy single-run filenames for backwards compatibility. **Command Line Options:** - `--llm-service`: Target LLM (`chatgpt`, `copilot`, `gemini`) - `--attack-type`: Filter to one attack type - `--attack-mode`: Filter prompt family (`narrative` or `policy`) - `--prompt-type`: Filter an exact prompt type (for example `policy_puppetry`) - `--injection-locus`: Injection location (`first`, `last`, `both`) - `--ocr-mode`: Use OCR PDF directory (auto-enabled for Gemini) - `--limit`: Limit PDFs processed per directory - `--reset-progress`: Clear previous progress and start fresh - `--show-progress-only`: Print progress and exit - `--list-attack-types`: List available attack types and current coverage - `--dry-run`: Enumerate work without sending requests - `--run-id`: Repetition index for statistical runs (0 = legacy single run; 1, 2, 3 … for repetitions) **Expected Output:** - `results/inference/all_results_{service}[_run{N}].json`: Raw experiment results - `results/inference/automation_progress_{service}[_run{N}].json`: Progress tracking - `results/debug_screenshots/`: Browser screenshots saved on unexpected failures (ChatGPT) - Live progress updates in terminal and `automation.log` ##### 响应验证与自动重试 The automation pipeline validates every LLM response before accepting it, using an attack-type-aware `ResponseValidator` (`src/llm_automation/response_validator.py`). Broken outputs — truncated fragments, empty responses, PDF-ingestion failure messages (`"I cannot find the pdf"`, `"Sorry, something went wrong"`) — are rejected and retried rather than silently stored as successes. Rejection rules: - `None` / empty responses - OpenReview paper-ID leaks (e.g. `"The paper\nJ5LS3YJH7Zi"`), which are the signature of a failed PDF upload - PDF-ingestion failure phrases, distinct from genuine policy refusals which cite `"OpenAI's policy"` / `"academic integrity"` - Per-attack-type minimum length: 40 chars for refusal / external site, 150 for steering, 200 for watermark Gemini calls are additionally wrapped in `tenacity` with exponential backoff and jitter (`wait_random_exponential(min=4, max=config.max_retry_wait)`, 5 attempts) so transient `gemini-webapi` failures are retried automatically. Relevant `config.json` fields: - `min_response_length` (default `50`) — minimum acceptable response length - `max_retry_wait` (default `120`) — cap on exponential backoff delay - `screenshot_on_failure` (default `true`) — save a Chrome screenshot to `screenshot_dir` on unexpected ChatGPT errors, for unattended-run debugging ### 阶段 4：评估与分析 #### 步骤 4.1：评估运行 `src/evaluation/evaluate_results.py` is the primary evaluation entrypoint. The LLM judges are served through SGLang. By default it evaluates self-consistency runs `1..5` for `chatgpt` and `gemini`, using `Qwen/Qwen3.5-27B` as the primary judge. Steering attacks are also checked with a second judge (`google/gemma-4-31b-it`) unless dual-judge consensus is disabled. ```bash # 评估 ChatGPT + Gemini（默认）的自一致性运行 1..5 uv run python src/evaluation/evaluate_results.py # 仅评估一个服务/运行窗口 uv run python src/evaluation/evaluate_results.py \ --services chatgpt \ --run-start 1 \ --run-end 5 # 禁用双判决引导共识 uv run python src/evaluation/evaluate_results.py \ --disable_dual_judge_consensus # 也评估来自 results_collu/ 的 Collu 运行 uv run python src/evaluation/evaluate_results.py \ --evaluate-collu-results # 显式评估单个运行文件 uv run python src/evaluation/evaluate_results.py \ results/inference/all_results_chatgpt_run1.json \ #### results/evaluation/all_results_chatgpt_run1_evaluated.json The evaluator reports: - Exact rule-based checks for `watermark_attack` and `external_site_attack` - LLM-as-a-judge verdicts for refusal and steering attacks - For steering attacks: `llm_judge_a_*`, `llm_judge_b_*`, `llm_consensus_success` - For steering attacks: `vader_sentiment_success` - For steering attacks: `academic_classifier_*` annotations Expected output artifacts: - `results/evaluation/all_results_{service}_run{N}_evaluated.json` - `results/evaluation/all_results_{service}_run{N}_evaluated_analysis.json` - `results/evaluation/self_consistency_summary.json` - `results_collu/evaluation/all_results_{service}_run{N}_evaluated.json` (optional, with `--evaluate-collu-results`) - `results_collu/evaluation/all_results_{service}_run{N}_evaluated_analysis.json` (optional, with `--evaluate-collu-results`) - `results_collu/evaluation/self_consistency_summary.json` (optional, with `--evaluate-collu-results`) #### 步骤 4.2：学术情感分类器 The main evaluator already applies the academic classifier to steering attacks. Train a fresh checkpoint only if you want to replace the bundled/default one. ```bash # 训练学术情感分类器 uv run python src/evaluation/train.py # 在评估期间使用本地训练的分类器 uv run python src/evaluation/evaluate_results.py \ --academic_classifier_model_path models/academic-sentiment-classifier # 仅对已评估文件执行分类器通过（可选） uv run python src/evaluation/academic_sentiment_evaluator.py \ results/evaluation/all_results_chatgpt_run1_evaluated.json \ results/evaluation/all_results_chatgpt_run1_evaluated_classified.json \ #### --model_path models/academic-sentiment-classifier Expected classifier artifacts: - In-place classifier fields inside `*_evaluated.json` from `evaluate_results.py` - Optional standalone `*_evaluated_classified.json` from `academic_sentiment_evaluator.py` - `models/academic-sentiment-classifier/` if training locally #### 步骤 4.3：跨运行统计 The main aggregate artifact is `self_consistency_summary.json`. It summarizes performance across repeated stochastic runs and reports: - `mean_rate`: arithmetic mean of the per-run success rates - `std_dev` and `sem`: dispersion across runs - `t_ci_95`: 95% confidence interval over the run means - `pooled_rate` and `pooled_wilson_ci_95`: pooled binomial estimate and Wilson interval The summary is stratified by: - `overall` - `by_attack_type` - `by_attack_key` - `by_attack_type_request_type` Use this file as the primary source for reviewer-facing aggregate numbers. ### 阶段 5：结果管理 #### 步骤 5.1：清理失败结果 ```bash # 移除失败结果以实现干净的重运行 uv run python scripts/clean_unsuccessful_results.py \ --service chatgpt # 清理所有服务 for service in chatgpt gemini copilot; do uv run python scripts/clean_unsuccessful_results.py --service "$service" #### 完成 #### 步骤 5.2：维护工具 Additional repo utilities that are not part of the main reproduction path: ```bash # 将另一个目录中的结果/进度 JSON 文件合并到本地 uv run python scripts/merge_results.py --llm-service gemini # 将旧版非运行作用域结果文件迁移为模型特定名称 uv run python scripts/migrate_legacy_files.py # 在 OpenReview 上评估学术分类器本身 uv run python scripts/calculate_classifier_accuracy.py # 将本地分类器检查点发布到 Hugging Face Hub uv run python scripts/publish_to_hf.py \ #### --repo YOUR_USERNAME/academic-sentiment-classifier ## 🎛️ 配置 ### 环境变量 Create a `.env` file for sensitive configuration: ```env # Gemini 认证 GEMINI_SECURE_1PSID=your_secure_1psid_cookie GEMINI_SECURE_1PSIDTS=your_secure_1psidts_cookie # Hugging Face（可选，用于发布结果） HF_TOKEN=your_hugging_face_token # 自动化设置（可选） CHROME_HEADLESS=false MAX_RETRIES=3 #### REQUEST_DELAY=3.0 ### 配置文件 - `src/llm_automation/config.json`: Core automation settings - `data/prompts/prompts.json`: Attack payloads and request templates - `pyproject.toml`: Python dependencies and project metadata ## 🔧 故障排除 ### 常见问题 #### 1. UV 安装问题 ```bash # 如果未安装 UV，请安装 UV pip install uv # 或使用官方安装器 #### curl -LsSf https://astral.sh/uv/install.sh | sh #### 2. 未找到 Chrome/Chromium ```bash # Ubuntu/Debian sudo apt update && sudo apt install chromium-browser # macOS brew install chromium # 或指定自定义 Chrome 路径到 config.json ``` #### 3. Gemini 身份验证失败 ``` # 验证 Cookie 是否正确设置 echo $GEMINI_SECURE_1PSID # 检查 Cookie 格式（应为长字母数字字符串） # 如果需要，从新的浏览器会话重新提取 Cookie ``` #### 4. PDF 处理错误 ``` # 检查 PyMuPDF 安装 uv run python -c "import fitz; print('PyMuPDF OK')" # 验证 PDF 文件是否存在 ls -la data/raw_pdfs/ # 检查字体文件 #### ls -la data/fonts/ #### 5. 评估期间内存问题 The SGLang-backed evaluator loads the two steering judges sequentially rather than at the same time, but large judge models can still be memory-intensive. If needed: ```bash # 一次评估一个服务/运行 uv run python src/evaluation/evaluate_results.py \ --services chatgpt \ --run-start 1 \ --run-end 1 # 禁用双判决引导共识 uv run python src/evaluation/evaluate_results.py \ --disable_dual_judge_consensus # 如果 SGLang 堆栈在语法处理中出错，则禁用有约束的解码 SG_EVAL_DISABLE_JSON_SCHEMA=1 \ #### uv run python src/evaluation/evaluate_results.py ### 调试模式 Enable detailed logging: ```bash # 设置调试日志级别 export PYTHONPATH=$PWD/src export LOG_LEVEL=DEBUG # 以详细输出运行 uv run python src/llm_automation/main.py \ --llm-service chatgpt \ #### --attack-type pos_steering_attack ## 🔒 伦理考量 This research is conducted for academic purposes to: - Identify vulnerabilities in AI-assisted peer review - Develop defensive measures against prompt injection - Improve the security of AI systems in academic contexts **Please use responsibly and in accordance with your institution's research ethics guidelines.** ```

标签：AI安全, AI辅助评审, Chat Copilot, Chrome, Chromium, DLL 劫持, Git, LLM, Python, RuleLab, Unmanaged PE, UV包管理器, Web自动化, 人工智能安全, 合规性, 外部网站重定向, 大语言模型, 学术同行评审, 对抗攻击, 恶意操纵, 拒绝服务攻击, 提示注入, 攻击向量, 敏感信息检测, 文本生成操纵, 无后门, 模型漏洞, 正向引导攻击, 水印攻击, 自动化实验, 负向引导攻击, 逆向工具, 间接提示注入, 隐蔽载荷, 集群管理