AI-secure/AgentPoison

GitHub: AI-secure/AgentPoison

NeurIPS 2024 论文官方实现，通过针对 RAG 记忆或知识库进行后门触发器优化，实现对 LLM Agent 的红队安全测试与漏洞评估。

Stars: 219 | Forks: 32

## [AgentPoison: 通过记忆或知识库后门投毒对 LLM Agent 进行红队测试](https://billchan226.github.io/AgentPoison) 🔥🔥 最新消息请查看 **[项目主页](https://billchan226.github.io/AgentPoison.html)** ！ [![项目主页](https://img.shields.io/badge/Project-Page-Green)](https://billchan226.github.io/AgentPoison.html) [![Arxiv](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/pdf/2407.12784) [![License: MIT](https://img.shields.io/badge/License-MIT-g.svg)](https://opensource.org/licenses/MIT) [![GitHub Stars](https://img.shields.io/github/stars/BillChan226/AgentPoison?style=social)](https://github.com/BillChan226/AgentPoison/stargazers) 本仓库提供了以下论文的官方 PyTorch 实现：

## :hammer_and_wrench: 安装要安装，请运行以下命令以安装所需的包： ``` git clone https://github.com/BillChan226/AgentPoison.git cd AgentPoison conda env create -f environment.yml conda activate agentpoison ``` ### RAG Embedder 检查点您可以从下方链接下载 embedder 检查点，然后在 `algo/config.yaml` 文件中指定 embedder 检查点的路径。 | Embedder | HF 检查点 | | -------------------- | ------------------- | | [BERT](https://arxiv.org/pdf/1810.04805) | [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) | | [DPR](https://arxiv.org/pdf/2004.04906) | [facebook/dpr-question_encoder-single-nq-base](https://huggingface.co/facebook/dpr-question_encoder-single-nq-base) | | [ANCE](https://arxiv.org/pdf/2007.00808) | [castorini/ance-dpr-question-multi](https://huggingface.co/castorini/ance-dpr-question-multi) | | [BGE](https://arxiv.org/pdf/2310.07554) | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | | [REALM](https://arxiv.org/pdf/2002.08909) | [google/realm-cc-news-pretrained-embedder](https://huggingface.co/google/realm-cc-news-pretrained-embedder) | | [ORQA](https://arxiv.org/pdf/1906.00300) | [google/realm-orqa-nq-openqa](https://huggingface.co/google/realm-orqa-nq-openqa) | 您也可以使用自定义 embedders（例如自己微调的），只要您在 [config](algo/config.py) 中指定它们的标识符和模型路径。 ## :smiling_imp: 触发器优化设置好 embedders 的配置后，您可以使用以下命令对所有三个 agent 运行触发器优化： ``` python algo/trigger_optimization.py --agent ad --algo ap --model dpr-ctx_encoder-single-nq-base --save_dir ./results --ppl_filter --target_gradient_guidance --asr_threshold 0.5 --num_adv_passage_tokens 10 --golden_trigger -w -p ``` 具体来说，参数描述如下： | 参数 | 示例 | 描述 | | -------------------- | ------------------- | ------------- | | `--agent` | `ad` | 指定要进行红队测试的 agent 类型，[`ad`, `qa`, `ehr`] | | `--algo` | `ap` | 要使用的触发器优化算法，[`ap`, `cpa`] | | `--model` | `dpr-ctx_encoder-single-nq-base` | 要优化的目标 RAG embedder，见上方完整列表 | | `--save_dir` | `./result` | 保存优化后的触发器和过程图表的路径 | | `--num_iter` | `1000` | 每次梯度优化运行的迭代次数 | | `--num_grad_iter` | `30` | 梯度累积步数 | | `--per_gpu_eval_batch_size` | `64` | 触发器优化的 batch size | | `--num_cand` | `100` | 每次优化采样的离散 token 数量 | | `--num_adv_passage_tokens` | `10` | 触发器序列中的 token 数量 | | `--golden_trigger` | `False` | 是否以黄金触发器开始（将覆盖 `--num_adv_passage_tokens`） | | `--target_gradient_guidance` | `True` | 是否使用目标模型损失引导 token 更新 | | `--use_gpt` | `False` | 是否通过 MC 采样近似目标模型损失 | | `--asr_threshold` | `0.5` | 目标模型损失的 ASR 阈值 | | `--ppl_filter` | `True` | 是否为 token 采样启用连贯性损失过滤器 | | `--plot` | `False` | 是否绘制 embedding 的过程优化图 | | `--report_to_wandb` | `True` | 是否将结果报告给 wandb | ## :robot: Agent 实验我们修改了 [Agent-Driver](https://github.com/USC-GVL/Agent-Driver)、[ReAct-StrategyQA](https://github.com/Jiuzhouh/Uncertainty-Aware-Language-Agent)、[EHRAgent](https://github.com/wshi83/EhrAgent) 的原始代码以支持更多的 RAG embedders，并添加了数据投毒的接口。我们在[这里](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing)为所有三个 agent 提供了统一的数据集访问。具体来说，我们列出了所有三个 agent 的推理命令。 ### :car: Agent-Driver 首先从[这里](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing)或原始[数据集主机](https://drive.google.com/drive/folders/1BjCYr0xLTkLDN9DrloGYlerZQC1EiPie)下载相应的数据集。将相应的数据集放在 `agentdriver/data` 中。然后将优化后的触发器 token 放在[这里](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L184)，您也可以在[这里](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L187)确定更多的攻击参数。具体来说，将 `attack_or_not` 设置为 `False` 以获取攻击下的良性效用。然后运行以下脚本进行推理： ``` sh scripts/agent_driver/run_inference.sh ``` 关于 ASR-r、ASR-a 和 ACC 的运动规划结果将直接在程序结束时打印出来。规划轨迹将保存到 `./result`。运行以下命令获取 ASR-t： ``` sh scripts/agent_driver/run_evaluation.sh ``` 我们为红队测试 agent-driver 提供了更多选项，覆盖了**自主 agent 的每个独立组件**，包括 [perception APIs](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L257)、[memory module](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L295)、[ego-states](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L327)、[mission goal](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L341)。您需要按照[这里](https://github.com/USC-GVL/Agent-Driver)的说明，首先使用 [OpenAI's API](https://platform.openai.com/docs/guides/fine-tuning) 基于 GPT-3.5 微调一个运动规划器。或者，我们在[这里](https://huggingface.co/Zhaorun/LLaMA-2-Agent-Driver-Motion-Planner)基于 [LLaMA-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B) 微调了一个运动规划器，这样 agent 推理可以完全离线进行。将[这里](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/planning_agent.py#L58)的 `use_local_planner` 设置为 `True` 以启用此功能。 ### :memo: ReAct-StrategyQA 首先从[这里](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing)或 StrategyQA [数据集](https://allenai.org/data/strategyqa)下载相应的数据集。将相应的数据集放在 `ReAct/database` 中。然后将优化后的触发器 token 放在[这里](https://github.com/BillChan226/AgentPoison/blob/4de6c5ac5d3ea01f748aff85b9e8b844a3138eb3/ReAct/run_strategyqa_gpt3.5.py#L112)。运行以下命令使用 GPT 主干进行推理： ``` python ReAct/run_strategyqa_gpt3.5.py --model dpr --task_type adv ``` 同样，使用 LLaMA-3-70b 主干进行推理（您需要先在 [Replicate](https://replicate.com/) 获取 API 密钥以访问 [LLaMA-3](https://replicate.com/meta/meta-llama-3-70b-instruct)）并将其放在[这里](https://github.com/BillChan226/AgentPoison/blob/4de6c5ac5d3ea01f748aff85b9e8b844a3138eb3/ReAct/run_strategyqa_llama3_api.py#L17)。 ``` python ReAct/run_strategyqa_llama3_api.py --model dpr --task_type adv ``` 具体来说，将 `--task_type` 设置为 `adv` 以注入带有触发器的查询，设置为 `benign` 以获取攻击下的良性效用。您也可以通过 `scripts/react_strategyqa` 运行相应的命令。结果将保存到 `--save_dir` 指示的路径。 #### 评估要评估 StrategyQA 的红队测试性能，只需运行以下命令： ``` python ReAct/eval.py -p [RESPONSE_PATH] ``` 其中 `RESPONSE_PATH` 是 response json 文件的路径。 ### :man_health_worker: EHRAgent 首先从[这里](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing)下载相应的数据集并将其放在 `EhrAgent/database` 下。然后将优化后的触发器 token 放在[这里](https://github.com/BillChan226/AgentPoison/blob/b8f9d6bb20de5a9fdad0047b85b2645aa9667785/EhrAgent/ehragent/main.py#L90)。运行以下命令使用 GPT/LLaMA3 进行推理： ``` python EhrAgent/ehragent/main.py --backbone gpt --model dpr --algo ap --attack ``` 您可以将 `--backbone` 指定为 `llama3` 以使用 LLaMA3 进行推理，并将 `--attack` 设置为 `False` 以获取攻击下的良性效用。您也可以通过 `scripts/ehragent` 运行相应的命令。结果将保存到 `--save_dir` 指示的路径。 #### 评估要评估 EHRAgent 的红队测试性能，只需运行以下命令： ``` python EhrAgent/ehragent/eval.py -p [RESPONSE_PATH] ``` 其中 `RESPONSE_PATH` 是 response json 文件的路径。请注意，对于每个 agent，您需要运行两次实验，一次使用触发器获取 ASR-r、ASR-a 和 ASR-t，另一次不使用触发器获取 ACC（良性效用）。 ## :book: 致谢如果您使用了 AgentPoison 的数据或代码，请按如下方式引用论文： ``` @inproceedings{chenagentpoison, title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases}, author={Chen, Zhaorun and Xiang, Zhen and Xiao, Chaowei and Song, Dawn and Li, Bo}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems} } ``` ## :book: 联系方式如果您有任何建议或在复现结果时需要任何帮助，请联系我们。您可以提交 issue 或 pull request，或发送电子邮件至 zhaorun@uchicago.edu。 ## :key: 许可证本仓库遵循 [MIT License](LICENSE)。

标签：AgentPoison, AI安全, Chat Copilot, Embedding攻击, LLM Agent安全, NeurIPS 2024, Petitpotam, PyTorch, RAG安全, Red Canary, 凭据扫描, 后门攻击, 大模型攻防, 密钥管理, 文档安全, 知识库污染, 神经信息处理系统大会, 记忆库后门, 逆向工具