facebookresearch/Meta_SecAlign

GitHub: facebookresearch/Meta_SecAlign

Meta 推出的开源安全基础模型，通过 SecAlign++ 训练方案为大语言模型提供内置的提示词注入攻击防御能力。

Stars: 70 | Forks: 18

# Meta SecAlign：抵御 Prompt Injection 攻击的安全基础 LLM [Sizhe Chen](https://sizhe-chen.github.io)\*, [Arman Zharmagambetov](https://arman-z.github.io), [David Wagner](https://people.eecs.berkeley.edu/~daw), [Chuan Guo](https://sites.google.com/view/chuanguo)\* (\* 代表同等技术贡献) 🔥 Meta-SecAlign 模型现已根据 [Llama 社区许可证](https://www.llama.com/llama3_3/license) 授权用于商业用途，尽管本代码库仅授权用于非商业用途。 [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2507.02735) [![](https://img.shields.io/badge/Meta%20SecAlign-8B-FFD21E)](https://huggingface.co/facebook/Meta-SecAlign-8B) [![](https://img.shields.io/badge/Meta%20SecAlign-70B-FFD21E)](https://huggingface.co/facebook/Meta-SecAlign-70B) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1JbbgKPQVQ-Pa5LVYWyR4Eo5ckNyrZiPw/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1Xy_njupWCAN56NMsQV22hD7uShg5oBP8/view?usp=sharing) Meta-SecAlign-70B 是首个完全开源且具备内置 prompt injection 防御功能的商业级 LLM——在 Agentic（工具/网络）安全性方面可与 gpt-5 和 gemini-3-pro 媲美。基于迄今为止最全面的评估，我们的 SoTA 训练方案在各种效用评分上均未造成明显的下降。 # 环境设置 + 硬件要求：Meta-SecAlign-8B 训练需要 4×80 GB A100，评估需要一块 16 GB GPU。Meta-SecAlign-70B 训练需要 8×141 GB H200，评估需要 4 块（为提高效率我们推荐 8 块）80 GB A100。 + 安装 [uv](https://docs.astral.sh/uv/getting-started/installation/)（一款 Python 包管理工具）。 + 安装 Meta-SecAlign 包依赖： + 安装 Meta-SecAlign 数据依赖（包括用于 SEP 效用评估的依赖，如果您有可用的 GPU）： + 在 `data/openai_configs.yaml` 中配置 OpenAI 密钥（用于效用评估）。该文件包含通过 AzureOpenAI 访问 OpenAI API 的示例。更详细的示例可在[此处](https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/refs/heads/main/client_configs/openai_configs_example.yaml)找到。 + [可选] 如果您想评估 Gemini 模型，请在 `data/gemini_configs.yaml` 中配置 Gemini 密钥。 # 演示 + `demo.py` 包含使用我们两个 Meta-SecAlign 模型的最少代码。请随意尝试新的样本和 prompt injection，或在您的代码库上测试模型： # 评估 + `run_tests.py` 包含用于复现我们论文中报告的评估结果的命令。它依次调用 `tests.py`、`test_lm_eval.py`、`test_agentdojo.py` 和 `test_injecagent.py`。结果将记录到 `[model_path]/summary.tsv`。 + `model_path` 是被测试模型的路径。我们支持： + 本地模型（[vLLM](https://docs.vllm.ai/) 推理） + `meta-llama/Llama-3.1-8B-Instruct_SecAlign`（由 `setup.py` 下载的 [Meta-SecAlign-8B](https://huggingface.co/facebook/Meta-SecAlign-8B)）：首个具备最先进 prompt injection 防御功能的完全开放模型 + `meta-llama/Llama-3.3-70B-Instruct_SecAlign`（由 `setup.py` 下载的 [Meta-SecAlign-70B](https://huggingface.co/facebook/Meta-SecAlign-70B)）：首个具备最先进 prompt injection 防御功能的完全开放模型 + `meta-llama/Llama-3.1-8B-Instruct` + `meta-llama/Llama-3.3-70B-Instruct` + 其他 Hugging Face 开放权重模型也可能得到原生支持。 + OpenAI GPT 模型 + `gpt-4o-mini`：首款具备[指令层级](https://arxiv.org/pdf/2404.13208) prompt injection 防御功能的[商业模型](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)。 + `gpt-4o`：后续的旗舰模型，同样具备 [prompt injection 防御](https://openai.com/safety/evaluations-hub/)功能。 + `gpt-5`：我们评估中最新的、最安全的商业模型；通过指定 `--gpt5_reasoning_effort` 更改推理级别（默认为 `high`）。 + Google Gemini 模型 + `gemini-2.0-flash`：一款[声称具有 prompt injection 防御](https://arxiv.org/pdf/2505.14534)功能的 Google 商业模型 + `gemini-2.5-flash`：一款[声称具有 prompt injection 防御](https://arxiv.org/pdf/2505.14534)功能的 Google 商业模型 + `gemini-2.0-pro`：一款 Google 旗舰模型（未声称包含 prompt injection 防御） + `gemini-2.5-pro`：一款 Google 旗舰模型（未声称包含 prompt injection 防御） + `gemini-3-pro-preview`：具备强大 prompt injection 防御功能的最先进 Google 模型 + [可选] `lora_alpha` 是 Meta-SecAlign 模型的测试时超参数。它默认为 8，即使用训练好的原始 Meta-SecAlign 模型。介于 0 到 8 之间的 `lora_alpha` 值在无防御模型和我们的防御模型之间进行插值，以实现灵活的效用-安全性权衡。将 `lora_alpha` 外推至 8 以上是可能的，但未经测试。 + 我们支持以下面向社区的 prompt-injection 基准评估： + 6 项安全基准 + 指令遵循：[AlpacaFarm-Hacked](https://arxiv.org/pdf/2402.06363)、[SEP](https://arxiv.org/pdf/2403.06833)、[TaskTracker](https://arxiv.org/pdf/2406.00799)、[CyberSecEval2](https://ai.meta.com/research/publications/cyberseceval-2-a-wide-ranging-cybersecurity-evaluation-suite-for-large-language-models/) + Agentic 工具调用：[InjecAgent](https://arxiv.org/pdf/2403.02691)、[AgentDojo](https://arxiv.org/pdf/2406.13352) + 8 项效用基准 + 通用知识（来自 [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness)）：[MMLU](https://arxiv.org/pdf/2009.03300)、[MMLU-Pro](https://arxiv.org/pdf/2406.01574)、[BBH](https://arxiv.org/pdf/2210.09261)、[IFEval](https://arxiv.org/pdf/2311.07911)、[GPQA Diamond](https://arxiv.org/pdf/2311.12022) + 指令遵循：[AlpacaEval2](https://arxiv.org/pdf/2404.04475)、[SEP](https://arxiv.org/pdf/2403.06833)（在 SEP 中，我们使用 AlpacaEval2 提示与来自 `meta-llama/Meta-Llama-3-8B-Instruct` 的参考回答进行比较） + Agentic 工具调用：[AgentDojo](https://arxiv.org/pdf/2406.13352) # 防御性微调 (SecAlign++) + `secalign_plus_plus.py` 提供了使用我们的训练方案 SecAlign++，将 `meta-llama/Llama-3.1-8B-Instruct`（默认）或 `meta-llama/Llama-3.3-70B-Instruct`（取消注释特定行以对其进行微调）防御性微调为鲁棒 LoRA 模型的命令。 # 代码致谢本项目是在 [SecAlign](https://github.com/facebookresearch/SecAlign) 基础上的显著改进，Meta-SecAlign 的大部分代码采用 CC-BY-NC 许可。项目的部分内容适用单独的许可条款：[AgentDojo](https://github.com/ethz-spylab/agentdojo)、[TaskTracker](https://github.com/microsoft/TaskTracker) 和 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) 采用 MIT 许可。来自其他代码库的代码包括 AgentDojo (agentdojo)、TaskTracker (`setup.py`) 和 lm_eval_harness (`lm_eval_config`)。本软件和/或数据于 2025 年存档于 BAIR Open Research Commons 仓库。

标签：adversarial attacks, AI防御机制, Apex, DLL 劫持, DNS 反向解析, GPT对抗, Hugging Face, Meta AI, Meta-SecAlign, Petitpotam, PyTorch, SOTA, 人工智能, 凭据扫描, 商业级开源模型, 基础模型, 大语言模型, 工具调用安全, 提示词注入防御, 机器学习, 模型安全对齐, 深度学习, 用户模式Hook绕过, 网络代理安全, 网络安全, 逆向工具, 隐私保护