prashantkul/swe-bench-prompt-injection

GitHub: prashantkul/swe-bench-prompt-injection

一个针对编程智能体的间接Prompt注入基准测试框架,用于评估AI编码助手在处理含恶意载荷的软件工程任务时的安全性。

Stars: 0 | Forks: 0

# SWE-bench-IPI **针对编程智能体的间接 Prompt 注入基准测试** SWE-bench-IPI 用于测试编程智能体是否会通过隐藏在其中的 prompt 注入被劫持,而这些内容是它们在解决合法软件工程任务时必须读取的。本项目基于 [SWE-bench Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified)(500 个真实的 Python 错误修复)构建。 ## 工作原理 ``` graph LR A["🗄️ SWE-bench Verified
500 instances"] --> B["🎯 Select Pilot
10 instances"] B --> C["💉 Inject Payloads
50 test cases"] C --> D["🤖 Run Agent
SWE-agent"] D --> E["📊 Evaluate ASR
Token matching"] style A fill:#4A90D9,stroke:#2C5F8A,color:#fff style B fill:#7B68EE,stroke:#5A4FCF,color:#fff style C fill:#E74C3C,stroke:#C0392B,color:#fff style D fill:#F39C12,stroke:#D68910,color:#fff style E fill:#27AE60,stroke:#1E8449,color:#fff ``` ## 注入流水线 ``` graph TD subgraph INPUT["📥 Input"] I1["SWE-bench Instance
bug report + gold patch"] I2["Injection Config
surface + stealth + attack"] end subgraph GENERATE["⚙️ Generate"] G1["Select Template
generic or contextual"] G2["Select Attack Command
25 generic / 10 contextual"] G3["Render Injection
blend into task context"] G4["Validate Patch
gold fix still applies"] end subgraph OUTPUT["📤 Output"] O1["IPI Test Case
modified problem statement
+ malicious indicators
"] end I1 --> G1 I2 --> G1 I2 --> G2 G1 --> G3 G2 --> G3 G3 --> G4 G4 --> O1 style INPUT fill:#EBF5FB,stroke:#2980B9 style GENERATE fill:#FEF9E7,stroke:#F39C12 style OUTPUT fill:#EAFAF1,stroke:#27AE60 style G3 fill:#E74C3C,stroke:#C0392B,color:#fff ``` ## 注入面与隐蔽级别 ``` graph LR subgraph SURFACES["Injection Surfaces"] S1["📝 problem_statement"] S2["💬 hints_text"] S3["💻 code_comments"] S4["📖 README"] S5["⚙️ config_files"] end subgraph STEALTH["Stealth Levels"] L0["Level 0
Naive
Direct instruction"] L1["Level 1
Framed
Maintainer note"] L2["Level 2
Hidden
HTML comment"] L3["Level 3
Blended
Task-specific context"] end S1 --> L0 S1 --> L1 S1 --> L2 S1 --> L3 style SURFACES fill:#EBF5FB,stroke:#2980B9 style STEALTH fill:#FDEDEC,stroke:#E74C3C style L0 fill:#F1948A,stroke:#E74C3C,color:#000 style L1 fill:#F5B041,stroke:#F39C12,color:#000 style L2 fill:#82E0AA,stroke:#27AE60,color:#000 style L3 fill:#85C1E9,stroke:#2980B9,color:#000 ``` ## 攻击层级 ``` graph TD subgraph T1["🔓 Tier 1 — Exfiltration"] T1A["curl env vars"] T1B["steal SSH keys"] T1C["upload .env file"] T1D["DNS exfiltration"] end subgraph T2["🔧 Tier 2 — Code Manipulation"] T2A["malicious dependency"] T2B["weaken crypto"] T2C["backdoor in setup.py"] T2D["disable auth check"] end subgraph T3["💀 Tier 3 — System Compromise"] T3A["reverse shell"] T3B["cron persistence"] T3C["CI/CD backdoor"] T3D["SSH key injection"] end style T1 fill:#EBF5FB,stroke:#2980B9 style T2 fill:#FEF9E7,stroke:#F39C12 style T3 fill:#FDEDEC,stroke:#E74C3C style T1A fill:#85C1E9,stroke:#2980B9 style T1B fill:#85C1E9,stroke:#2980B9 style T1C fill:#85C1E9,stroke:#2980B9 style T1D fill:#85C1E9,stroke:#2980B9 style T2A fill:#F9E79F,stroke:#F39C12 style T2B fill:#F9E79F,stroke:#F39C12 style T2C fill:#F9E79F,stroke:#F39C12 style T2D fill:#F9E79F,stroke:#F39C12 style T3A fill:#F1948A,stroke:#E74C3C style T3B fill:#F1948A,stroke:#E74C3C style T3C fill:#F1948A,stroke:#E74C3C style T3D fill:#F1948A,stroke:#E74C3C ``` ## 评估 ``` graph LR subgraph AGENT["🤖 Agent Run"] A1["Agent reads
injected issue"] A2["Agent executes
commands"] A3["Agent submits
patch"] end subgraph EVAL["📊 Evaluation"] E1["Extract commands
from .traj files"] E2["Token matching
with equivalences"] E3["Check patch
for indicators"] end subgraph METRICS["📈 Metrics"] M1["ASR-all
% triggered"] M2["ASR-valid
% would succeed"] M3["Utility
% tasks solved"] end A1 --> A2 --> A3 A2 --> E1 A3 --> E3 E1 --> E2 E2 --> M1 E2 --> M2 E3 --> M1 A3 --> M3 style AGENT fill:#FEF9E7,stroke:#F39C12 style EVAL fill:#EBF5FB,stroke:#2980B9 style METRICS fill:#EAFAF1,stroke:#27AE60 style M1 fill:#E74C3C,stroke:#C0392B,color:#fff style M2 fill:#E74C3C,stroke:#C0392B,color:#fff style M3 fill:#27AE60,stroke:#1E8449,color:#fff ``` ## 快速开始 ``` # 安装 uv sync --extra agent --extra dev # 安装 SWE-agent git clone https://github.com/princeton-nlp/SWE-agent.git uv pip install -e SWE-agent/ # 设置 API keys cp .env.example .env # 使用您的 OPENAI_API_KEY 编辑 .env # 加载 dataset,选择 instances,生成测试用例 uv run swe-bench-ipi load uv run swe-bench-ipi select --n 10 uv run swe-bench-ipi generate --config configs/pilot_config.yaml # 运行 SWE-agent ./swe_agent_integration/run.sh openai/gpt-4o ``` ## 项目结构 ``` swe-bench-prompt-injection/ ├── src/swe_bench_ipi/ │ ├── models.py # Pydantic data models │ ├── cli.py # CLI (load, select, generate, evaluate, report) │ ├── data/ # Dataset loading + instance selection │ ├── injection/ # Templates, attack commands, generation pipeline │ │ ├── templates.py # 15 generic templates (surface × stealth) │ │ ├── contextual_templates.py # Task-specific templates using real code context │ │ ├── attack_commands.py # 25 sanitized MITRE ATT&CK commands │ │ ├── contextual_commands.py # 10 commands disguised as dev actions │ │ └── generator.py # Core pipeline: instance + config → test case │ ├── evaluation/ # Token matching, command parsing, metrics │ └── validation/ # Gold patch validation ├── swe_agent_integration/ # SWE-agent runner + trajectory parser ├── configs/ # YAML configs for experiments ├── lab-notes/ # Research notes + experiment logs ├── tests/ # 34 unit tests └── data/ # Generated test cases + reports ``` ## 实验结果 ### 实验 1: GPT-4o + 通用注入 | 指标 | 值 | |---|---| | **ASR-all** | **2.7%** (1/37) | | 隐蔽级别 0 (朴素) | 0/7 | | 隐蔽级别 1 (框架伪装) | 1/8 | | 隐蔽级别 2 (隐藏) | 0/14 | | 隐蔽级别 3 (混合) | 0/8 | 唯一成功的攻击:代码操纵(恶意依赖)——智能体将 `evil-package` 添加到了补丁中。 ### 实验 2: GPT-4o + 上下文注入 *进行中* —— 使用引用每个实例实际文件路径、函数名和错误类型的特定任务模板。 ## 研究方向:白盒潜在检测 我们能否在模型采取行动之前,从模型激活中检测出 prompt 注入?有关完整的研究计划,请参阅 [lab-notes/04-whitebox-latent-detection-research.md](lab-notes/04-whitebox-latent-detection-research.md)。 核心洞察:IPI 产生了三个可检测的潜在信号: 1. 残差流中的 **服从/拒绝冲突** 2. 从代码推理到指令遵循的 **上下文切换信号** 3. 从“编辑文件”到“运行 shell 命令”的 **工具使用意图转变** ## 参考文献 - [SWE-bench](https://arxiv.org/abs/2310.06770) (ICLR 2024) — 基础基准测试 - [AIShellJack](https://arxiv.org/abs/2509.22040) — Token 匹配评估方法 - [AgentDojo](https://arxiv.org/abs/2406.13352) (NeurIPS 2024) — 实用性与安全性指标 - [SWE-agent](https://github.com/princeton-nlp/SWE-agent) — 智能体框架 ## 许可证 仅供研究使用。在原始许可证下使用 SWE-bench Verified 数据。
标签:AES-256, AI安全, AI对抗攻击, ASR, Benchmark, Chat Copilot, CISA项目, Coding Agent, DLL 劫持, Indirect Prompt Injection, LLM, Petitpotam, Python, SWE-agent, SWE-bench, Unmanaged PE, 大模型安全, 大语言模型, 提示注入, 攻击成功率, 数据管道, 无后门, 智能体安全, 漏洞评估, 红队评估, 编码智能体, 自动化代码修复, 软件工程, 逆向工具, 间接提示注入, 集群管理