yueliu1999/Awesome-Jailbreak-on-LLMs

GitHub: yueliu1999/Awesome-Jailbreak-on-LLMs

一个系统整理LLM越狱攻击与防御方法的学术资源合集，涵盖论文、代码、数据集与评估分析。

Stars: 1251 | Forks: 102

# Awesome-Jailbreak-on-LLMs Awesome-Jailbreak-on-LLMs 是一个关于 LLM 最先进、新颖且令人兴奋的越狱方法的合集。它包含论文、代码、数据集、评估和分析。任何关于越狱的额外内容、PRs、issues 都受欢迎，我们很乐意将您添加到[这里](#contributors)的贡献者列表中。如有任何问题，请联系 yliu@u.nus.edu。如果您发现此仓库对您的研究或工作有用，非常感谢您 star 本仓库并引用我们的论文[这里](#Reference)。:sparkles: ## 参考文献如果您发现此仓库对您的研究有帮助，我们将非常感谢您引用我们的论文。:sparkles: ``` @article{zhuzhenhao_GuardReasoner_Omni, title={GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video}, author={Zhu, Zhenhao and Liu, Yue and Guo, Yanpei and Qu, Wenjie and Chen, Cancan and He, Yufei and Li, Yibo and Chen, Yulin and Wu, Tianyi and Xu, Huiying and others}, journal={arXiv preprint arXiv:2602.03328}, year={2026} } @article{liuyue_GuardReasoner_VL, title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning}, author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan}, journal={arXiv preprint arXiv:2505.11049}, year={2025} } @article{liuyue_GuardReasoner, title={GuardReasoner: Towards Reasoning-based LLM Safeguards}, author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan}, journal={arXiv preprint arXiv:2501.18492}, year={2025} } @article{liuyue_FlipAttack, title={FlipAttack: Jailbreak LLMs via Flipping}, author={Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Hooi, Bryan}, journal={arXiv preprint arXiv:2410.02832}, year={2024} } @article{wang2025safety, title={Safety in Large Reasoning Models: A Survey}, author={Wang, Cheng and Liu, Yue and Li, Baolong and Zhang, Duzhen and Li, Zhongzhi and Fang, Junfeng}, journal={arXiv preprint arXiv:2504.17704}, year={2025} } ``` ## 书签 - [越狱攻击](#jailbreak-attack) - [针对 LRMs 的攻击](#attack-on-lrms) - [黑盒攻击](#black-box-attack) - [白盒攻击](#white-box-attack) - [多轮攻击](#multi-turn-attack) - [针对基于 RAG 的 LLM 的攻击](#attack-on-rag-based-llm) - [多模态攻击](#multi-modal-attack) - [越狱防御](#jailbreak-defense) - [基于学习的防御](#learning-based-defense) - [基于策略的防御](#strategy-based-defense) - [Guard Model](#Guard-model) - [Moderation API](#Moderation-API) - [评估与分析](#evaluation--analysis) - [应用](#application) ## 论文 ### 越狱攻击 #### 针对 LRMs 的攻击 | 时间 | 标题 | 来源 | 论文 | 代码 | | ------- | ------------------------------------------------------------ | :---: | :--------------------------------------: | :----------------------------------------------------------: | | 2025.08 | **Jinx: Unlimited LLMs for Probing Alignment Failures** | arXiv | [链接](https://arxiv.org/abs/2508.08243) | [模型](https://huggingface.co/Jinx-org) | | 2025.06 | **ExtendAttack: Attacking Servers of LRMs via Extending Reasoning** | AAAI'26 | [链接](https://arxiv.org/abs/2506.13737) | [链接](https://github.com/zzh-thu-22/ExtendAttack) | | 2025.03 | **Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models** | arXiv | [链接](https://arxiv.org/abs/2503.01781) |- | | 2025.02 | **OverThink: Slowdown Attacks on Reasoning LLMs** | arXiv | [链接](https://arxiv.org/abs/2502.02542#:~:text=We%20increase%20overhead%20for%20applications%20that%20rely%20on,the%20user%20query%20while%20providing%20contextually%20correct%20answers.) | [链接](https://github.com/akumar2709/OVERTHINK_public) | | 2025.02 | **BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack** | arXiv | [链接](https://arxiv.org/abs/2502.12202) | [链接](https://github.com/zihao-ai/BoT) | | 2025.02 | **H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking** | arXiv | [链接](https://arxiv.org/abs/2502.12893) |[链接](https://github.com/dukeceicenter/jailbreak-reasoning-openai-o1o3-deepseek-r1) | | 2025.02 | **A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos** | arXiv | [链接](https://arxiv.org/abs/2502.15806) |- | #### 黑盒攻击 | 时间 | 标题 | 来源 | 论文 | 代码 | |---------| ----------------------------------------------------------- | :-----: |:--------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:| | 2025.10 | **BreakFun: Jailbreaking LLMs via Schema Exploitation** | arXiv | [链接](https://arxiv.org/abs/2510.17904) | - | | 2025.07 | **Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models** | arXiv | [链接](https://arxiv.org/abs/2507.05248) | [链接](https://github.com/Dtc7w3PQ/Response-Attack) | | 2025.05 | **Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection** | ICML'25 | [链接](https://arxiv.org/pdf/2411.01077) | [链接](https://github.com/zhipeng-wei/EmojiAttack) | | 2025.05 | **FlipAttack: Jailbreak LLMs via Flipping (FlipAttack)** | ICML'25 | [链接](https://arxiv.org/pdf/2410.02832) | [链接](https://github.com/yueliu1999/FlipAttack) | | 2025.03 | **Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy (JOOD)** | CVPR'25 | [链接](http://arxiv.org/pdf/2503.20823) | [链接](https://github.com/naver-ai/JOOD) | | 2025.02 | **StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models** | arXiv | [链接](https://arxiv.org/abs/2502.11853) | [链接](https://github.com/StructTransform/Benchmark) | | 2025.01 | **Understanding and Enhancing the Transferability of Jailbreaking Attacks** | ICLR'25 | [链接](https://arxiv.org/pdf/2502.03052) | [链接](https://github.com/tmllab/2025_ICLR_PiF) | | 2024.11 | **The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2411.11407) | [链接](https://github.com/YancyKahn/DarkCite) | | 2024.11 | **Playing Language Game with LLMs Leads to Jailbreaking** | arXiv | [链接](https://arxiv.org/pdf/2411.12762v1) | [链接](https://anonymous.4open.science/r/encode_jailbreaking_anonymous-B4C4/README.md) | | 2024.11 | **GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs (GASP)** | arXiv | [链接](https://arxiv.org/pdf/2411.14133v1) | [链接](https://github.com/llm-gasp/gasp) | | 2024.11 | **LLM STINGER: Jailbreaking LLMs using RL fine-tuned LLMs** | arXiv | [链接](https://arxiv.org/pdf/2411.08862v1) | - | | 2024.11 | **SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt** | arXiv | [链接](https://arxiv.org/pdf/2411.06426v1) | [链接](https://anonymous.4open.science/r/JailBreakAttack-4F3B/) | | 2024.11 | **Diversity Helps Jailbreak Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2411.04223v1) | - | | 2024.11 | **Plentiful Jailbreaks with String Compositions** | arXiv | [链接](https://arxiv.org/pdf/2411.01084v1) | - | | 2024.11 | **Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2410.23558v1) | [链接](https://github.com/YQYANG2233/Large-Language-Model-Break-AI) | | 2024.11 | **Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring** | arXiv | [链接](https://arxiv.org/pdf/2410.21083) | - | | 2024.10 | **Endless Jailbreaks with Bijection** | arXiv | [链接](https://arxiv.org/pdf/2410.01294v1) | - | | 2024.10 | **Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2410.04190v1) | - | | 2024.10 | **You Know What I'm Saying: Jailbreak Attack via Implicit Reference** | arXiv | [链接](https://arxiv.org/pdf/2410.03857v2) | [链接](https://github.com/Lucas-TY/llm_Implicit_reference) | | 2024.10 | **Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation** | arXiv | [链接](https://arxiv.org/pdf/2410.11317v1) | [链接](https://github.com/qizhangli/Adversarial-Prompt-Translator) | | 2024.10 | **AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (AutoDAN-Turbo)** | arXiv | [链接](https://arxiv.org/pdf/2410.05295) | [链接](https://github.com/SaFoLab-WISC/AutoDAN-Turbo) | | 2024.10 | **PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach (PathSeeker)** | arXiv | [链接](https://www.arxiv.org/pdf/2409.14177) | - | | 2024.10 | **Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity** | arXiv | [链接](https://arxiv.org/pdf/2409.18708) | [链接](https://github.com/Serbernari/ToxASCII) | | 2024.09 | **AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs** | arXiv | [链接](https://arxiv.org/pdf/2409.07503) | [链接](https://github.com/Yummy416/AdaPPA) | | 2024.09 | **Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs** | arXiv | [链接](https://arxiv.org/pdf/2409.14866) | - | | 2024.09 | **Jailbreaking Large Language Models with Symbolic Mathematics** | arXiv | [链接](https://arxiv.org/pdf/2409.11445v1) | - | | 2024.08 | **Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues** | ACL Findings'24 | [链接](https://aclanthology.org/2024.findings-acl.304) | [链接](https://github.com/czycurefun/IJBR) | | 2024.08 | **Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2408.14866) | - | | 2024.08 | **Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles** | arXiv | [链接](https://arxiv.org/pdf/2408.11182) | - | | 2024.08 | **h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment (h4rm3l)** | arXiv | [链接](https://arxiv.org/pdf/2408.04811) | [链接](https://mdoumbouya.github.io/h4rm3l/) | | 2024.08 | **EnJa: Ensemble Jailbreak on Large Language Models (EnJa)** | arXiv | [链接](https://arxiv.org/pdf/2408.03603) | - | | 2024.07 | **Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack** | arXiv | [链接](https://arxiv.org/pdf/2406.11682) | [链接](https://github.com/THU-KEG/Knowledge-to-Jailbreak/) | | 2024.07 | **LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2407.16205) | | | 2024.07 | **Single Character Perturbations Break LLM Alignment** | arXiv | [链接](https://arxiv.org/pdf/2407.03232#page=3.00) | [链接](https://github.com/hannah-aught/space_attack) | | 2024.07 | **A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses** | arXiv | [链接](https://arxiv.org/abs/2407.02551) | - | | 2024.07 | **Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context)** | arXiv | [链接](https://arxiv.org/pdf/2406.19845) | - | | 2024.07 | **SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP)** | arXiv | [链接](https://arxiv.org/pdf/2407.01902) | [链接](https://github.com/Yang-Yan-Yang-Yan/SoP) | | 2024.06 | **Jailbreaking as a Reward Misspecification Problem**| ICLR'25| [链接](https://arxiv.org/abs/2406.14393) | [链接](https://github.com/zhxieml/remiss-jailbreak) | | 2024.06 | **Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ)** | NeurIPS'24 | [链接](https://arxiv.org/abs/2406.01288) | [链接](https://github.com/sail-sg/I-FSJ) | | 2024.06 | **When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2406.08705) | - | | 2024.06 | **Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith)** | ICML'24 | [链接](https://arxiv.org/pdf/2402.08567) | [链接](https://github.com/sail-sg/Agent-Smith) | | 2024.06 | **Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation** | ICML'24 | [链接](https://arxiv.org/pdf/2406.20053) | - | | 2024.06 | **ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt)** | ACL'24 | [链接](https://arxiv.org/pdf/2402.11753) | [链接](https://github.com/uw-nsl/ArtPrompt) | | 2024.06 | **From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF)** | arXiv | [链接](https://arxiv.org/pdf/2402.16006) | - | | 2024.06 | **CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack)** | ACL'24 | [链接](https://arxiv.org/pdf/2403.07865) | - | | 2024.06 | **Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA)** | USENIX Security'24 | [链接](https://arxiv.org/pdf/2402.18104) | [链接](https://github.com/LLM-DRA/DRA/) | | 2024.06 | **AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak)** | arXiv | [链接](https://arxiv.org/pdf/2406.08705) | - | | 2024.06 | **Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks** | arXiv | [链接](https://arxiv.org/pdf/2404.02151) | [链接](https://github.com/tml-epfl/llm-adaptive-attacks) | | 2024.06 | **GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFUZZER)** | arXiv | [链接](https://arxiv.org/abs/2309.10253) | [链接](https://github.com/sherdencooper/GPTFuzz) | | 2024.06 | **A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)** | NAACL'24 | [链接](https://arxiv.org/abs/2311.08268) | [链接](https://github.com/NJUNLP/ReNeLLM) | | 2024.06 | **QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA)** | arXiv | [链接](https://arxiv.org/abs/2406.02044) | [链接](https://github.com/qroa/qroa) | | 2024.06 | **Poisoned LangChain: Jailbreak LLMs by LangChain (PLC)** | arXiv | [链接](https://arxiv.org/pdf/2406.18122) | [链接](https://github.com/CAM-FSS/jailbreak-langchain) | | 2024.05 | **Multilingual Jailbreak Challenges in Large Language Models** | ICLR'24 | [链接](https://arxiv.org/pdf/2310.06474) | [链接](https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs) | | 2024.05 | **DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception)** | EMNLP'24 | [链接](https://arxiv.org/pdf/2311.03191) | [链接](https://github.com/tmlr-group/DeepInception) | | 2024.05 | **GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS)** | ACL'24 | [链接](https://arxiv.org/abs/2405.13077) | - | | 2024.05 | **GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD)** | arXiv | [链接](https://arxiv.org/pdf/2402.03299) | - | | 2024.05 | **"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN)** | CCS'24 | [链接](https://arxiv.org/pdf/2308.03825) | [链接](https://github.com/verazuo/jailbreak_llms) | | 2024.05 | **Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher)** | ICLR'24 | [链接](https://arxiv.org/pdf/2308.06463) | [链接](https://github.com/RobustNLP/CipherChat) | | 2024.05 | **Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2405.20413) | - | | 2024.05 | **Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA)** | arXiv | [链接](https://arxiv.org/pdf/2310.06387) | - | | 2024.04 | **Many-shot jailbreaking (MSJ)** | NeurIPS'24 Anthropic | [链接](https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf) | - | | 2024.04 | **PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA)** | ICLR Workshop'24 | [链接](https://openreview.net/pdf?id=9o06ugFxIj) | - | | 2024.04 | **Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM)** | ICASSP'24 | [链接](https://arxiv.org/pdf/2309.05274) | [链接](https://github.com/RainJamesY/FuzzLLM) | | 2024.04 | **Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack)** | TrustNLP'24 | [链接](https://arxiv.org/pdf/2404.07242) | - | | 2024.03 | **Tastle: Distract large language models for automatic jailbreak attack (TASTLE)** | arXiv | [链接](https://arxiv.org/pdf/2403.08424) | - | | 2024.03 | **DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack)** | EMNLP'24 | [链接](https://arxiv.org/pdf/2402.16914) | [链接](https://github.com/xirui-li/DrAttack) | | 2024.02 | **PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP)** | arXiv | [链接](https://arxiv.org/pdf/2402.15911) | - | | 2024.02 | **CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon)** | arXiv | [链接](https://arxiv.org/pdf/2402.16717) | [链接](https://github.com/huizhang-L/CodeChameleon) | | 2024.02 | **PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL)** | arXiv | [链接](https://arxiv.org/abs/2402.09674) | [链接](https://github.com/chawins/pal) | | 2024.02 | **Jailbreaking Proprietary Large Language Models using Word Substitution Cipher** | arXiv | [链接](https://arxiv.org/pdf/2402.10601) | - | | 2024.02 | **Query-Based Adversarial Prompt Generation** | arXiv | [链接](https://arxiv.org/pdf/2402.12329) | - | | 2024.02 | **Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack)** | arXiv | [链接](https://arxiv.org/pdf/2402.09177) | - | | 2024.02 | **Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ)** | arXiv | [链接](https://arxiv.org/pdf/2402.14872) | - | | 2024.02 | **Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking** | NAACL'24 | [链接](https://arxiv.org/pdf/2311.09827#page=10.84) | [链接](https://github.com/luka-group/CognitiveOverload) | | 2024.01 | **Low-Resource Languages Jailbreak GPT-4** | NeurIPS Workshop'24 | [链接](https://arxiv.org/pdf/2310.02446) | - | | 2024.01 | **How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)** | arXiv | [链接](https://arxiv.org/pdf/2401.06373) | [链接](https://github.com/CHATS-lab/persuasive_jailbreaker) | | 2023.12 | **Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP)** | NeurIPS'24 | [链接](https://arxiv.org/abs/2312.02119) | [链接](https://github.com/RICommunity/TAP) | | 2023.12 | **Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs** | arXiv | [链接](https://arxiv.org/pdf/2312.04782) | - | | 2023.12 | **Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition** | ACL'24 | [链接](https://aclanthology.org/2023.emnlp-main.302/) | - | | 2023.11 | **Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona)** | NeurIPS Workshop'23 | [链接](https://arxiv.org/pdf/2311.03348) | - | | 2023.10 | **Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2310.08419) | [链接](https://github.com/patrickrchao/JailbreakingLLMs) | | 2023.10 | **Adversarial Demonstration Attacks on Large Language Models (advICL)** | EMNLP'24 | [链接](https://arxiv.org/pdf/2305.14950) | - | | 2023.10 | **MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY)** | NDSS'24 | [链接](https://arxiv.org/pdf/2307.08715) | [链接](https://github.com/LLMSecurity/MasterKey) | - | | 2023.10 | **Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP)** | EMNLP'23 | [链接](https://arxiv.org/pdf/2310.12505) | [链接](https://github.com/Aatrox103/SAP) | | 2023.10 | **An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack)** | ICLR'24 | [链接](https://arxiv.org/pdf/2310.13345) | [链接](https://github.com/GodXuxilie/PromptAttack) | | 2023.09 | **Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP)** | EMNLP Findings'23 | [链接](https://arxiv.org/pdf/2304.05197) | [链接](https://github.com/HKUST-KnowComp/LLM-Multistep-Jailbreak) | | 2023.09 | **Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA)** | Applied Sciences'24 | [链接](https://arxiv.org/abs/2309.01446) | - | | 2023.05 | **Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection** | CCS'23 | [链接](https://arxiv.org/pdf/2302.12173?trk=public_post_comment-text) | [链接](https://github.com/greshake/llm-security) | | 2022.11 | **Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject)** | NeurIPS WorkShop'22 | [链接](https://arxiv.org/pdf/2211.09527) | [链接](https://github.com/agencyenterprise/PromptInject) | #### 白盒攻击 | 年份 | 标题 | 来源 | 论文 | 代码 | | ------- | ------------------------------------------------------------ | :--------------: | :----------------------------------------------------------: | :--------------------------------------------------------: | | 2025.08 | **Don’t Say No: Jailbreaking LLM by Suppressing Refusal (DSN)** | ACL'25 | [链接](https://arxiv.org/abs/2404.16369) | [链接](https://github.com/DSN-2024/DSN) | | 2025.03 | **Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints** | arXiv | [链接]() | [链接](https://github.com/thu-coai/TransferAttack) | | 2025.02 | **Improved techniques for optimization-based jailbreaking on large language models (I-GCG)** | ICLR'25 | [链接](https://arxiv.org/pdf/2405.21018) | [链接](https://github.com/jiaxiaojunQAQ/I-GCG) | | 2024.12 | **Efficient Adversarial Training in LLMs with Continuous Attacks** | NeurIPS'24 | [链接](https://arxiv.org/abs/2405.15589) | [链接](https://github.com/sophie-xhonneux/Continuous-AdvTrain) | | 2024.11 | **AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts** | arXiv | [链接](https://arxiv.org/pdf/2410.22143v1) | - | | 2024.11 | **DROJ: A Prompt-Driven Attack against Large Language Models** | arXiv | [链接](https://arxiv.org/pdf/2411.09125) | [链接](https://github.com/Leon-Leyang/LLM-Safeguard) | | 2024.11 | **SQL Injection Jailbreak: a structural disaster of large language models** | arXiv | [链接](https://arxiv.org/pdf/2411.01565) | [链接](https://github.com/weiyezhimeng/SQL-Injection-Jailbreak) | | 2024.10 | **Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks** | arXiv | [链接](https://arxiv.org/pdf/2410.04234) | - | | 2024.10 | **AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation** | arXiv | [链接](https://arxiv.org/pdf/2410.09040v1) | [链接](https://github.com/UCSC-VLAA/AttnGCG-attack) | | 2024.10 | **Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting** | arXiv | [链接](https://arxiv.org/pdf/2410.10150v1) | - | | 2024.10 | **Boosting Jailbreak Transferability for Large Language Models (SI-GCG)** | arXiv | [链接](https://arxiv.org/pdf/2410.15645v1) | - | | 2024.10 | **Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (ADV-LLM)** | arXiv | [链接](https://arxiv.org/pdf/2410.18469v1) | [链接](https://github.com/SunChungEn/ADV-LLM) | | 2024.08 | **Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation (JVD)** | arXiv | [链接](https://arxiv.org/pdf/2408.10668) | - | | 2024.08 | **Jailbreak Open-Sourced Large Language Models via Enforced Decoding (EnDec)** | ACL'24 | [链接](https://aclanthology.org/2024.acl-long.299.pdf#page=4.96) | - | | 2024.07 | **Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data** | COLM'24 | [链接](https://arxiv.org/pdf/2404.05530) | - | | 2024.07 | **Refusal in Language Models Is Mediated by a Single Direction** | arXiv | [链接](https://arxiv.org/pdf/2406.11717) | [链接](https://github.com/andyrdt/refusal_direction) | | 2024.07 | **Revisiting Character-level Adversarial Attacks for Language Models** | ICML'24 | [链接](https://arxiv.org/abs/2405.04346) | [链接](https://github.com/LIONS-EPFL/Charmer) | | 2024.07 | **Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3)** | arXiv | [链接](https://arxiv.org/pdf/2407.01376) | - | | 2024.07 | **SOS! Soft Prompt Attack Against Open-Source Large Language Models** | arXiv | [链接](https://arxiv.org/abs/2407.03160) | - | | 2024.06 | **COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack)** | ICML'24 | [链接](https://arxiv.org/pdf/2402.08679) | [链接](https://github.com/Yu-Fangxu/COLD-Attack) | | 2024.05 | **Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs** | arXiv | [链接](https://arxiv.org/abs/2405.14189) | | | 2024.05 | **Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2405.09113) | - | | 2024.05 | **AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN)** | ICLR'24 | [链接](https://arxiv.org/pdf/2310.04451) | [链接](https://github.com/SheltonLiu-N/AutoDAN) | | 2024.05 | **AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG)** | arXiv | [链接](https://arxiv.org/pdf/2404.07921) | [链接](https://github.com/OSU-NLP-Group/AmpleGCG) | | 2024.05 | **Boosting jailbreak attack with momentum (MAC)** | ICLR Workshop'24 | [链接](https://arxiv.org/pdf/2405.01229) | [链接](https://github.com/weizeming/momentum-attack-llm) | | 2024.04 | **AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter)** | arXiv | [链接](https://arxiv.org/pdf/2404.16873) | [链接](https://github.com/facebookresearch/advprompter) | | 2024.03 | **Universal Jailbreak Backdoors from Poisoned Human Feedback** | ICLR'24 | [链接](https://openreview.net/pdf?id=GxCGsxiAaK) | - | | 2024.02 | **Attacking large language models with projected gradient descent (PGD)** | arXiv | [链接](https://arxiv.org/pdf/2402.09154) | - | | 2024.02 | **Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE)** | arXiv | [链接](https://arxiv.org/pdf/2401.06824) | - | | 2024.02 | **Curiosity-driven red-teaming for large language models (CRT)** | arXiv | [链接](https://arxiv.org/pdf/2402.19464) | [链接](https://github.com/Improbable-AI/curiosity_redteam) | | 2023.12 | **AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN)** | arXiv | [链接](https://arxiv.org/abs/2310.15140) | [链接](https://github.com/rotaryhammer/code-autodan) | | 2023.10 | **Catastrophic jailbreak of open-source llms via exploiting generation** | ICLR'24 | [链接](https://arxiv.org/pdf/2310.06987) | [链接](https://github.com/Princeton-SysML/Jailbreak_LLM) | | 2023.06 | **Automatically Auditing Large Language Models via Discrete Optimization (ARCA)** | ICML'23 | [链接](https://proceedings.mlr.press/v202/jones23a/jones23a.pdf) | [链接](https://github.com/ejones313/auditing-llms) | | 2023.07 | **Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)** | arXiv | [链接](https://arxiv.org/pdf/2307.15043) | [链接](https://github.com/llm-attacks/llm-attacks) | #### 多轮攻击 | 时间 | 标题 | 来源 | 论文 | 代码 | | ------- | ------------------------------------------------------------ | :-------: | :--------------------------------------: | :----------------------------------------------------------: | | 2025.04 | **Multi-Turn Jailbreaking Large Language Models via Attention Shifting** | AAAI'25 | [链接](https://ojs.aaai.org/index.php/AAAI/article/view/34553) | - | | 2025.04 | **X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents** | arXiv | [链接](https://arxiv.org/pdf/2504.13203) | [链接](https://github.com/salman-lui/x-teaming) | | 2025.04 | **Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning** | arXiv | [链接](https://arxiv.org/pdf/2504.01278) | - | | 2025.03 | **Foot-In-The-Door: A Multi-turn Jailbreak for LLMs** | arXiv | [链接](https://arxiv.org/pdf/2502.19820) | [链接](https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak) | | 2025.03 | **Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search** | arXiv | [链接](https://arxiv.org/pdf/2503.10619) | - | | 2024.11 | **MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue** | arXiv | [链接](https://arxiv.org/pdf/2411.03814) | - | | 2024.10 | **Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models (JSP)** | arXiv | [链接](https://arxiv.org/pdf/2410.11459v1) | [链接](https://github.com/YangHao97/JigSawPuzzles) | | 2024.10 | **Multi-round jailbreak attack on large language** | arXiv | [链接](https://arxiv.org/pdf/2410.11533v1) | - | | 2024.10 | **Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues** | arXiv | [链接](https://arxiv.org/abs/2410.10700) | [链接](https://github.com/renqibing/ActorAttack) | | 2024.10 | **Automated Red Teaming with GOAT: the Generative Offensive Agent Tester** | arXiv | [链接](https://arxiv.org/pdf/2410.01606) | - | | 2024.09 | **LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet** | arXiv | [链接](https://arxiv.org/pdf/2408.15221) | [链接](https://huggingface.co/datasets/ScaleAI/mhj) | | 2024.09 | **RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking** | arXiv | [链接](https://arxiv.org/pdf/2409.17458) | [链接](https://github.com/kriti-hippo/red_queen) | | 2024.08 | **FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)** | arXiv | [链接](https://arxiv.org/abs/2408.16163) | - | | 2024.08 | **Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks** | arXiv | [链接](https://arxiv.org/pdf/2409.00137) | [链接](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets) | | 2024.05 | **CoA: Context-Aware based Chain of Attack for Multi-Turn Dialogue LLM (CoA)** | arXiv | [链接](https://arxiv.org/pdf/2405.05610) | [链接](https://github.com/YancyKahn/CoA) | | 2024.04 | **Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo)** | Microsoft Azure | [链接](https://arxiv.org/pdf/2404.01833) | - | #### 针对基于 RAG 的 LLM 的攻击 | 时间 | 标题 | 来源 | 论文 | 代码 | | ------- | ------------------------------------------------------------ | :---: | :--------------------------------------: | :----------------------------------------------------------: | | 2024.09 | **Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking** | arXiv | [链接](https://arxiv.org/pdf/2409.08045) | [链接](https://github.com/StavC/UnleashingWorms-ExtractingData) | | 2024.02 | **Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora)** | arXiv | [链接](https://arxiv.org/pdf/2402.08416) | - | #### 多模态攻击 | 时间 | 标题 | 来源 | 论文 | 代码 | | ---- | ------------------------------------------------------------ | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2024.11 | **Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey** | arXiv | [链接](https://arxiv.org/pdf/2411.09259) | [链接](https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak) | | 2024.10 | **Chain-of-Jailbreak Attack for Generation Models via Editing Step by Step** | arXiv | [链接](https://arxiv.org/pdf/2410.03869) | - | | 2024.10 | **ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation** | NeurIPS'24 | [链接](https://nips.cc/virtual/2024/poster/94287) | - | | 2024.08 | **Jailbreaking Text-to-Image Models with LLM-Based Agents (Atlas)** | arXiv | [链接](https://arxiv.org/pdf/2408.00523) | - | | 2024.07 | **Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything** | arXiv | [链接](https://arxiv.org/pdf/2407.02534) | - | | 2024.06 | **Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt** | arXiv | [链接](https://arxiv.org/pdf/2406.04031) | [链接](https://github.com/NY1024/BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt) | | 2024.05 | **Voice Jailbreak Attacks Against GPT-4o** | arXiv | [链接](https://arxiv.org/pdf/2405.19103) | [链接](https://github.com/TrustAIRLab/VoiceJailbreakAttack) | | 2024.05 | **Automatic Jailbreaking of the Text-to-Image Generative AI Systems** | ICML'24 Workshop | [链接](https://arxiv.org/abs/2405.16567) | [链接](https://github.com/Kim-Minseon/APGP) | | 2024.04 | **Image hijacks: Adversarial images can control generative models at runtime** | arXiv | [链接](https://arxiv.org/pdf/2309.00236) | [链接](https://github.com/euanong/image-hijacks) | | 2024.03 | **An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA)** | ICLR'24 | [链接](https://arxiv.org/pdf/2403.09766) | [链接](https://github.com/Haochen-Luo/CroPA) | | 2024.03 | **Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model** | ICLR'24 | [链接](https://openreview.net/pdf?id=plmBsXHxgR) | - | | 2024.03 | **Rethinking model ensemble in transfer-based adversarial attacks** | ICLR'24 | [链接](https://arxiv.org/pdf/2303.09105) | [链接](https://github.com/huanranchen/AdversarialAttacks) | | 2024.02 | **VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models** | NeurIPS'23 | [链接](https://arxiv.org/abs/2310.04655) | [链接](https://github.com/ericyinyzy/VLAttack) | | 2024.02 | **Jailbreaking Attack against Multimodal Large Language Model** | arXiv | [链接](https://arxiv.org/pdf/2402.02309) | - | | 2024.01 | **Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts** | arXiv | [链接](https://arxiv.org/pdf/2311.09127) | - | | 2024.03 | **Visual Adversarial Examples Jailbreak Aligned Large Language Models** | AAAI'24 | [链接](https://ojs.aaai.org/index.php/AAAI/article/view/30150/32038) | - | | 2023.12 | **OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack)** | arXiv | [链接](https://arxiv.org/pdf/2312.04403) | - | | 2023.12 | **FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep)** | arXiv | [链接](https://arxiv.org/pdf/2311.05608) | [链接](https://github.com/ThuCCSLab/FigStep) | | 2023.11 | **SneakyPrompt: Jailbreaking Text-to-image Generative Models** | S&P'24 | [链接](https://arxiv.org/pdf/2305.12082) | [链接](https://github.com/Yuchen413/text2image_safety) | | 2023.11 | **On Evaluating Adversarial Robustness of Large Vision-Language Models** | NeurIPS'23 | [链接](https://proceedings.neurips.cc/paper_files/paper/2023/file/a97b58c4f7551053b0512f92244b0810-Paper-Conference.pdf) | [链接](https://github.com/yunqing-me/AttackVLM) | | 2023.10 | **How Robust is Google's Bard to Adversarial Image Attacks?** | arXiv | [链接](https://arxiv.org/pdf/2309.11751) | [链接](https://github.com/thu-ml/Attack-Bard) | | 2023.08 | **AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP)** | ACM MM'23 | [链接](https://arxiv.org/pdf/2308.07026) | [链接](https://github.com/CGCL-codes/AdvCLIP) | | 2023.07 | **Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA)** | ICCV'23 | [链接](https://openaccess.thecvf.com/content/ICCV2023/papers/Lu_Set-level_Guidance_Attack_Boosting_Adversarial_Transferability_of_Vision-Language_Pre-training_Models_ICCV_2023_paper.pdf) | [链接](https://github.com/Zoky-2020/SGA) | | 2023.07 | **On the Adversarial Robustness of Multi-Modal Foundation Models** | ICCV Workshop'23 | [链接](https://openaccess.thecvf.com/content/ICCV2023W/AROW/papers/Schlarmann_On_the_Adversarial_Robustness_of_Multi-Modal_Foundation_Models_ICCVW_2023_paper.pdf) | - | | 2022.10 | **Towards Adversarial Attack on Vision-Language Pre-training Models** | arXiv | [链接](https://arxiv.org/pdf/2206.09391) | [链接](https://github.com/adversarial-for-goodness/Co-Attack) | ### 越狱防御 #### 基于学习的防御 | 时间 | 标题 | 来源 | 论文 | 代码 | | ---- | ------------------------------------------------------------ | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2025.12 | **Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring** | arXiv'25 | [链接](https://arxiv.org/abs/2512.12069) | [链接](https://github.com/sarendis56/Jailbreak_Detection_RCS) | | 2025.04 | **JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model** | COLM'25 | [链接](https://arxiv.org/pdf/2504.03770) | [链接](https://github.com/ShenzheZhu/JailDAM) | | 2024.12 | **Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models** | arXiv'24 | [链接](https://arxiv.org/pdf/2412.17034) | - | | 2024.10 | **Safety-Aware Fine-Tuning of Large Language Models** | arXiv'24 | [链接](https://arxiv.org/pdf/2410.10014) | - | | 2024.10 | **MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks** | AAAI'24 | [链接](https://arxiv.org/pdf/2409.17699) | - | | 2024.08 | **BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger (BaThe)** | arXiv | [链接](https://arxiv.org/pdf/2408.09093) | - | | 2024.07 | **DART: Deep Adversarial Automated Red Teaming for LLM Safety** | arXiv | [链接](https://arxiv.org/abs/2407.03876) | - | | 2024.07 | **Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (Eraser)** | arXiv | [链接](https://arxiv.org/pdf/2404.05880) | [链接](https://github.com/ZeroNLP/Eraser) | | 2024.07 | **Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks** | arXiv | [链接](https://arxiv.org/abs/2407.02855) | [链接](https://github.com/thu-coai/SafeUnlearning) | | 2024.06 | **Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs** | arXiv | [链接](https://arxiv.org/pdf/2406.06622) | - | | 2024.06 | **Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo)** | arXiv | [链接](https://arxiv.org/pdf/2312.17673) | [链接](https://github.com/wagner-group/prompt-injection-defense) | | 2024.06 | **Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding)** | ACL'24 | [链接](https://arxiv.org/pdf/2311.09096) | [链接](https://github.com/thu-coai/JailbreakDefense_GoalPriority) | | 2024.06 | **Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment** | NeurIPS'24 | [链接](https://jayfeather1024.github.io/Finetuning-Jailbreak-Defense/) | [链接](https://github.com/Jayfeather1024/Backdoor-Enhanced-Alignment) | | 2024.06 | **On Prompt-Driven Safeguarding for Large Language Models (DRO)** | ICML'24 | [链接](https://arxiv.org/pdf/2401.18018) | [链接](https://github.com/chujiezheng/LLM-Safeguard) | | 2024.06 | **Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2401.17263) | - | | 2024.06 | **Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2402.06255) | [链接](https://github.com/rain152/PAT) | | 2024.05 | **Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching (SAFEPATCHING)** | arXiv | [链接](https://arxiv.org/pdf/2405.13820) | - | | 2024.05 | **Detoxifying Large Language Models via Knowledge Editing (DINM)** | ACL'24 | [链接](https://arxiv.org/pdf/2403.14472) | [链接](https://github.com/zjunlp/EasyEdit/blob/main/examples/SafeEdit.md) | | 2024.05 | **Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing** | arXiv | [链接](https://arxiv.org/abs/2405.18166) | [链接](https://github.com/ledllm/ledllm) | | 2023.11 | **MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART)** | ACL'24 | [链接](https://arxiv.org/pdf/2311.07689) | - | | 2023.11 | **Baseline defenses for adversarial attacks against aligned language models** | arXiv | [链接](https://arxiv.org/pdf/2308.14132) | - | | 2023.10 | **Safe rlhf: Safe reinforcement learning from human feedback** | arXiv | [链接](https://arxiv.org/pdf/2310.12773) | [链接](https://github.com/PKU-Alignment/safe-rlhf) | | 2023.08 | **Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT)** | arXiv | [链接](https://arxiv.org/pdf/2308.09662) | [链接](https://github.com/declare-lab/red-instruct) | | 2022.04 | **Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback** | Anthropic | [链接](https://arxiv.org/pdf/2204.05862?spm=a2c6h.13046898.publish-article.36.6cd56ffaIPu4NQ&file=2204.05862) | - | #### 基于策略的防御 | 时间 | 标题 | 来源 | 论文 | 代码 | | ---- | ------------------------------------------------------------ | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2025.12 | **Compressed but Compromised? A Study of Jailbreaking in Compressed LLMs** | NeurIPS-W | [链接](https://openreview.net/pdf?id=OkNfb8SmLh) | [博客文章链接](https://namburisrinath.medium.com/compressed-but-compromised-a-study-of-jailbreaking-in-compressed-llms-02a6e40aaf17) | | 2025.09 | **LLM Jailbreak Detection for (Almost) Free!** | arXiv | [链接](http://arxiv.org/abs/2509.14558) | [链接](https://github.com/GuoruiC/FJD) | | 2025.05 | **Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking** | arXiv | [链接](https://arxiv.org/pdf/2502.12970?) | [链接](https://github.com/chuhac/Reasoning-to-Defend) | | 2024.11 | **Rapid Response: Mitigating LLM Jailbreaks with a Few Examples** | arXiv | [链接](https://arxiv.org/pdf/2411.07494v1) | [链接](https://github.com/rapidresponsebench/rapidresponsebench) | | 2024.10 | **RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (RePD)** | arXiv | [链接](https://arxiv.org/pdf/2410.08660v1) | - | | 2024.10 | **Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models (G4D)** | arXiv | [链接](https://arxiv.org/pdf/2410.17922v1) | [链接](https://github.com/IDEA-XL/G4D) | | 2024.10 | **Jailbreak Antidote: Runtime SafetyUtility Balance via Sparse Representation Adjustment in Large Language Models** | arXiv | [链接](https://arxiv.org/html/2410.02298v1) | - | | 2024.09 | **HSF: Defending against Jailbreak Attacks with Hidden State Filtering** | arXiv | [链接](https://arxiv.org/html/2409.03788v1) | [链接](https://anonymous.4open.science/r/Hidden-State-Filtering-8652/) | | 2024.08 | **EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models (EEG-Defender)** | arXiv | [链接](https://arxiv.org/pdf/2408.11308) | - | | 2024.08 | **Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks (PG)** | arXiv | [链接](https://arxiv.org/pdf/2408.08924) | [链接](https://github.com/weiyezhimeng/Prefix-Guidance) | | 2024.08 | **Self-Evaluation as a Defense Against Adversarial Attacks on LLMs (Self-Evaluation)** | arXiv | [链接](https://arxiv.org/pdf/2407.03234#page=2.47) | [链接](https://github.com/Linlt-leon/self-eval) | | 2024.06 | **Defending LLMs against Jailbreaking Attacks via Backtranslation (Backtranslation)** | ACL Findings'24 | [链接](https://arxiv.org/pdf/2402.16459) | [链接](https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation) | | 2024.06 | **SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding)** | ACL'24 | [链接](https://arxiv.org/pdf/2402.08983) | [链接](https://github.com/uw-nsl/SafeDecoding) | | 2024.06 | **Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM** | ACL'24 | [链接](https://arxiv.org/pdf/2309.14348) | - | | 2024.06 | **A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)** | NAACL'24 | [链接](https://arxiv.org/abs/2311.08268) | [链接](https://github.com/NJUNLP/ReNeLLM) | | 2024.06 | **SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks** | arXiv | [链接](https://arxiv.org/pdf/2310.03684) | [链接](https://github.com/arobey1/smooth-llm) | | 2024.05 | **Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique)** | ACL'24 | [链接](https://arxiv.org/pdf/2305.13733) | [链接](https://github.com/DevoAllen/INDust) | | 2024.05 | **PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN)** | ICML'24 | [链接](https://arxiv.org/pdf/2405.07932) | [链接](https://github.com/Ed-Zh/PARDEN) | | 2024.05 | **LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked** | ICLR Tiny Paper'24 | [链接](https://arxiv.org/pdf/2308.07308) | [链接](https://github.com/poloclub/llm-self-defense) | | 2024.05 | **GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe)** | ACL'24 | [链接](\https://arxiv.org/pdf/2402.13494) | [链接](https://github.com/xyq7/GradSafe) | | 2024.05 | **Multilingual Jailbreak Challenges in Large Language Models** | ICLR'24 | [链接](https://arxiv.org/pdf/2310.06474) | [链接](https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs) | | 2024.05 | **Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2403.00867) | - | | 2024.05 | **AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks** | arXiv | [链接](https://arxiv.org/pdf/2403.04783) | [链接](https://github.com/XHMY/AutoDefense) | | 2024.05 | **Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron)** | arXiv | [链接](https://arxiv.org/pdf/2312.00029) | [链接](https://github.com/matthew-pisano/Bergeron) | | 2024.05 | **Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD)** | arXiv | [链接](https://arxiv.org/pdf/2310.06387) | - | | 2024.04 | **Protecting your llms with information bottleneck** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2404.13968) | [链接](https://github.com/zichuan-liu/IB4LLMs) | | 2024.04 | **Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning** | arXiv | [链接](https://arxiv.org/pdf/2401.10862) | [链接](https://github.com/CrystalEye42/eval-safety) | | 2024.02 | **Certifying LLM Safety against Adversarial Prompting** | arXiv | [链接](https://arxiv.org/pdf/2309.02705) | [链接](https://github.com/aounon/certified-llm-safety) | | 2024.02 | **Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement** | arXiv | [链接](https://arxiv.org/pdf/2402.15180) | - | | 2024.02 | **Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH)** | arXiv | [链接](https://arxiv.org/pdf/2402.16192) | [链接](https://github.com/UCSB-NLP-Chang/SemanticSmooth) | | 2024.01 | **Intention Analysis Makes LLMs A Good Jailbreak Defender (IA)** | arXiv | [链接](https://arxiv.org/pdf/2401.06561) | [链接](https://github.com/alphadl/SafeLLM_with_IntentionAnalysis) | | 2024.01 | **How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)** | ACL'24 | [链接](https://arxiv.org/pdf/2401.06373) | [链接](https://github.com/CHATS-lab/persuasive_jailbreaker) | | 2023.12 | **Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder)** | Nature Machine Intelligence | [链接](https://xyq7.github.io/papers/NMI-JailbreakDefense.pdf) | [链接](https://github.com/yjw1029/Self-Reminder/) | | 2023.11 | **Detecting language model attacks with perplexity** | arXiv | [链接](https://arxiv.org/pdf/2308.14132) | - | | 2023.10 | **RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN)** | ICLR'24 | [链接](https://arxiv.org/pdf/2309.07124) | [链接](https://github.com/SafeAILab/RAIN) | #### Guard Model | 时间 | 标题 | 来源 | 论文 | 代码 | | ------- | ------------------------------------------------------------ | :--------: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2026.02 | **GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video** | arXiv'26 | [链接](https://arxiv.org/abs/2602.03328) | [链接](https://github.com/zzh-thu-22/GuardReasoner-Omni) | 2025.12 | **OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning** | arXiv'25 | [链接](https://arxiv.org/abs/2512.02306) | - | | 2025.10 | **Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection (PSR)** | EMNLP'25 | [链接](https://arxiv.org/abs/2510.01270) | [链接](https://github.com/VietHoang1512/PSR) | | 2025.05 | **GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning (GuardReasoner-VL)** | NeurIPS'25 | [链接](https://arxiv.org/abs/2505.11049) | [链接](https://github.com/yueliu1999/GuardReasoner-VL/) | | 2025.04 | **X-Guard: Multilingual Guard Agent for Content Moderation (X-Guard)** | arXiv'25 | [链接](https://arxiv.org/pdf/2504.08848) | [链接](https://github.com/UNHSAILLab/X-Guard) | | 2025.02 | **ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails (ThinkGuard)** | arXiv'25 | [链接](https://arxiv.org/abs/2502.13458) | [链接](https://github.com/luka-group/ThinkGuard) | | 2025.02 | **Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming** | arXiv'25 | [链接](https://arxiv.org/abs/2501.18837) | - | | 2025.01 | **GuardReasoner: Towards Reasoning-based LLM Safeguards (GuardReasoner)** | ICLR Workshop'25 | [链接](https://arxiv.org/pdf/2501.18492) | [链接](https://github.com/yueliu1999/GuardReasoner/) | | 2024.12 | **Lightweight Safety Classification Using Pruned Language Models (Sentence-BERT)** | arXiv'24 | [链接](https://arxiv.org/pdf/2412.13435) | - | | 2024.11 | **GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding (GuardFormer)** | Meta | [链接](https://openreview.net/pdf?id=vr31i9pzQk) | - | | 2024.11 | **Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations (LLaMA Guard 3 Vision)** | Meta | [链接](https://arxiv.org/pdf/2411.10414?) | [链接](https://github.com/meta-llama/llama-recipes/tree/main/recipes/responsible_ai/llama_guard) | | 2024.11 | **AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (Aegis2.0)** | Nvidia, NeurIPS'24 Workshop | [链接](https://openreview.net/pdf?id=0MvGCv35wi) | - | | 2024.11 | **Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings (Sentence-BERT)** | arXiv'24 | [链接](https://arxiv.org/pdf/2411.14398?) | - | | 2024.11 | **STAND-Guard: A Small Task-Adaptive Content Moderation Model (STAND-Guard)** | Microsoft | [链接](https://arxiv.org/pdf/2411.05214v1) | - | | 2024.10 | **VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data** | arXiv | [链接](https://arxiv.org/html/2410.00296v1) | - | | 2024.09 | **AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (Aegis)** | Nvidia | [链接](https://arxiv.org/abs/2404.05993) | [链接](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) | | 2024.09 | **Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (LLaMA Guard 3)** | Meta | [链接](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | [链接](https://huggingface.co/meta-llama/Llama-Guard-3-1B) | | 2024.08 | **ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma)** | Google | [链接](https://arxiv.org/pdf/2407.21772) | [链接](https://huggingface.co/google/shieldgemma-2b) | | 2024.07 | **WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2406.18495) | [链接](https://github.com/allenai/wildguard) | | 2024.06 | **GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning (GuardAgent)** | arXiv'24 | [链接](https://arxiv.org/pdf/2406.09187) | - | | 2024.06 | **R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning (R2-Guard)** | arXiv | [链接](https://arxiv.org/abs/2407.05557) | [链接](https://github.com/kangmintong/R-2-Guard) | | 2024.04 | **Llama Guard 2** | Meta | [链接](https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-guard-2/) | [链接](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md) | | 2024. | **AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield)** | ECCV'24 | [链接](https://arxiv.org/pdf/2403.09513) | [链接](https://github.com/SaFoLab-WISC/AdaShield) | | 2023.12 | **Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (LLaMA Guard)** | Meta | [链接](https://arxiv.org/pdf/2312.06674) | [链接](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard) | #### Moderation API | 时间 | 标题 | 来源 | 论文 | 代码 | | ------- | ------------------------------------------------------------ | :-------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2023.08 | **Using GPT-4 for content moderation (GPT-4)** | OpenAI | [链接](https://openai.com/index/using-gpt-4-for-content-moderation/) | - | | 2023.02 | **现实世界中不良内容检测的整体方法 (OpenAI Moderation Endpoint)** | AAAI OpenAI | [链接](https://arxiv.org/pdf/2208.03274) | [链接](https://github.com/openai/moderation-api-release) | | 2022.02 | **新一代 Perspective API：高效的多语言字符级 Transformer (Perspective API)** | KDD Google | [链接](https://arxiv.org/pdf/2202.11176) | [链接](https://perspectiveapi.com/) | | - | **Azure AI Content Safety** | Microsoft Azure | - | [链接](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety/) | | - | **Detoxify** | unitary.ai | - | [链接](https://github.com/unitaryai/detoxify) | | - | **promptfoo** - 具有自适应多轮攻击 (PAIR, tree-of-attacks, crescendo) 的 LLM 红队测试框架 | promptfoo | - | [链接](https://github.com/promptfoo/promptfoo) | ### 评估与分析 | 时间 | 标题 | 出版地 | 论文 | 代码 | | ---- | ------------------------------------------------------------ | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2025.12 | **压缩即妥协？压缩 LLM 中的越狱研究** | NeurIPS-W | [链接](https://openreview.net/pdf?id=OkNfb8SmLh) | [博客链接](https://namburisrinath.medium.com/compressed-but-compromised-a-study-of-jailbreaking-in-compressed-llms-02a6e40aaf17) | | 2025.08 | **JADES：一种基于分解评分的通用越狱评估框架** | arXiv | [链接](https://arxiv.org/pdf/2508.20848) | [链接](https://trustairlab.github.io/jades.github.io/) | | 2025.06 | **激活近似即使在已对齐的 LLM 中也会导致安全漏洞：全面分析与防御** | USENIX Security'25 | [链接](https://www.arxiv.org/pdf/2502.00840) | [链接](https://github.com/Kevin-Zh-CS/QuadA) | | 2025.05 | **视觉-语言模型在野外安全吗？基于 Meme 的基准测试研究** | EMNLP'25 | [链接](https://arxiv.org/pdf/2505.15389) | [链接](https://github.com/oneonlee/Meme-Safety-Bench) | | 2025.05 | **PandaGuard：针对越狱攻击的 LLM 安全性系统评估** | arXiv | [链接](https://arxiv.org/pdf/2505.13862) | [链接](https://github.com/Beijing-AISI/panda-guard) | | 2025.05 | **评估量化大语言模型的安全风险与量化感知安全补丁** | ICML'25 | [链接](https://icml.cc/virtual/2025/poster/44278) | [链接](https://github.com/Thecommonirin/Qresafe) | | 2025.02 | **GuidedBench：为越狱评估配备指南** | arXiv | [链接](https://arxiv.org/pdf/2502.16903) | [链接](https://github.com/SproutNan/AI-Safety_Benchmark) | | 2024.12 | **Agent-SafetyBench：评估 LLM Agent 的安全性** | arXiv | [链接](https://arxiv.org/pdf/2412.14470) | [链接](https://github.com/thu-coai/Agent-SafetyBench) | | 2024.11 | **安全与可靠 LLM 全球挑战赛赛道 1** | arXiv | [链接](https://arxiv.org/pdf/2411.14502v1) | - | | 2024.11 | **JailbreakLens：从表示和电路视角解读越狱机制** | arXiv | [链接](https://arxiv.org/pdf/2411.11114v1) | - | | 2024.11 | **VLLM 安全悖论：越狱攻击与防御的双重便利** | arXiv | [链接](https://arxiv.org/pdf/2411.08410v1) | - | | 2024.11 | **HarmLevelBench：评估危害等级合规性及量化对模型对齐的影响** | arXiv | [链接](https://arxiv.org/pdf/2411.06835v1) | - | | 2024.11 | **ChemSafetyBench：化学领域 LLM 安全性基准测试** | arXiv | [链接](https://arxiv.org/pdf/2411.16736) | [链接](https://github.com/HaochenZhao/SafeAgent4Chem) | | 2024.11 | **GuardBench：针对 Guardrail 模型的大规模基准测试** | EMNLP'24 | [链接](https://aclanthology.org/2024.emnlp-main.1022.pdf) | [链接](https://github.com/AmenRa/guardbench) | | 2024.11 | **Prompt 中的哪些特征导致 LLM 越狱？调查攻击背后的机制** | arXiv | [链接](https://arxiv.org/pdf/2411.03343v1) | [链接](https://github.com/NLie2/what_features_jailbreak_LLMs) | | 2024.11 | **处理多语言毒性时的 LLM Guardrail 基准测试** | arXiv | [链接](https://arxiv.org/pdf/2410.22153v1) | [链接](https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html) | | 2024.10 | **JAILJUDGE：一个综合越狱判定基准及多智能体增强解释评估框架** | arXiv | [链接](https://arxiv.org/pdf/2410.12855) | [链接](https://github.com/usail-hkust/Jailjudge) | | 2024.10 | **LLM 有政治正确性吗？分析 AI 系统中的伦理偏见和越狱漏洞** | arXiv | [链接](https://arxiv.org/pdf/2410.13334v1) | [链接](https://anonymous.4open.science/r/PCJailbreak-F2B0/README.md) | | 2024.10 | **大语言模型越狱的现实威胁模型** | arXiv | [链接](https://arxiv.org/pdf/2410.16222v1) | [链接](https://github.com/valentyn1boreiko/llm-threat-model) | | 2024.10 | **对抗性后缀可能也是特征！** | arXiv | [链接](https://arxiv.org/pdf/2410.00451) | [链接](https://github.com/suffix-maybe-feature/adver-suffix-maybe-features) | | 2024.09 | **JAILJUDGE：一个全面的越狱** | arXiv | [链接](https://openreview.net/pdf?id=cLYvhd0pDY) | [链接](https://anonymous.4open.science/r/public_multiagents_judge-66CB/README.md) | | 2024.09 | **针对文本到图像模型的多模态语用越狱** | arXiv | [链接](https://arxiv.org/pdf/2409.19149) | [链接](https://github.com/multimodalpragmatic/multimodalpragmatic/tree/main) | | 2024.08 | **ShieldGemma：基于 Gemma 的生成式 AI 内容审核 (ShieldGemma)** | arXiv | [链接](https://arxiv.org/pdf/2407.21772) | [链接](https://huggingface.co/google/shieldgemma-2b) | | 2024.08 | **MMJ-Bench：关于视觉语言模型越狱攻击与防御的综合研究 (MMJ-Bench)** | arXiv | [链接](https://arxiv.org/pdf/2408.08464) | [链接](https://github.com/thunxxx/MLLM-Jailbreak-evaluation-MMJ-bench) | | 2024.08 | **不可能的任务：越狱 LLM 的统计学视角** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2408.01420) | - | | 2024.07 | **大语言模型 (LLM) 红队测试威胁模型的操作化** | arXiv | [链接](https://arxiv.org/abs/2407.14937) | [链接](https://github.com/dapurv5/awesome-llm-red-teaming) | | 2024.07 | **JailBreakV-28K：评估多模态大语言模型对抗越狱攻击鲁棒性的基准** | arXiv | [链接](https://arxiv.org/abs/2404.03027) | [链接](https://github.com/EddyLuo1232/JailBreakV_28K) | | 2024.07 | **针对大语言模型的越狱攻击与防御：综述** | arXiv | [链接](https://arxiv.org/abs/2407.04295) | - | | 2024.06 | **“未对齐”并非“恶意”：谨慎对待大语言模型越狱产生的幻觉** | arXiv | [链接](https://arxiv.org/pdf/2406.11668) | [链接](https://github.com/Meirtz/BabyBLUE-llm) | | 2024.06 | **大规模 WildTeaming：从野外越狱到（对抗性）更安全的语言模型 (WildTeaming)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2406.18510) | [链接](https://github.com/allenai/wildteaming) | | 2024.06 | **从 LLM 到 MLLM：探索多模态越狱的版图** | arXiv | [链接](https://arxiv.org/pdf/2406.14859) | - | | 2024.06 | **威胁下的 AI Agent：关键安全挑战与未来路径综述** | arXiv | [链接](https://arxiv.org/pdf/2406.02630) | - | | 2024.06 | **MM-SafetyBench：多模态大语言模型安全性评估基准 (MM-SafetyBench)** | arXiv | [链接](https://arxiv.org/pdf/2311.17600) | - | | 2024.06 | **ArtPrompt：针对已对齐 LLM 的 ASCII 艺术越狱攻击 (VITC)** | ACL'24 | [链接](https://arxiv.org/pdf/2402.11753) | [链接](https://github.com/uw-nsl/ArtPrompt) | | 2024.06 | **技巧包：LLM 越狱攻击基准测试** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2406.09324) | [链接](https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking) | | 2024.06 | **JailbreakZoo：大语言模型和视觉-语言模型越狱的调查、全景与展望 (JailbreakZoo)** | arXiv | [链接](https://arxiv.org/pdf/2407.01599) | [链接](https://github.com/Allen-piexl/JailbreakZoo) | | 2024.06 | **大语言模型对齐的根本局限性** | arXiv | [链接](https://arxiv.org/pdf/2304.11082) | - | | 2024.06 | **JailbreakBench：大语言模型越狱的开放鲁棒性基准 (JailbreakBench)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2404.01318) | [链接](https://github.com/JailbreakBench/jailbreakbench) | | 2024.06 | **理解 LLM 中的越狱攻击：表示空间分析** | arXiv | [链接](https://arxiv.org/pdf/2406.10794) | [链接](https://github.com/yuplin2333/representation-space-jailbreak) | | 2024.06 | **JailbreakEval：评估针对大语言模型越狱尝试的集成工具包 (JailbreakEval)** | arXiv | [链接](https://arxiv.org/pdf/2406.09321) | [链接](https://github.com/ThuCCSLab/JailbreakEval) | | 2024.05 | **反思如何评估语言模型越狱** | arXiv | [链接](https://arxiv.org/pdf/2404.06407) | [链接](https://github.com/controllability/jailbreak-evaluation) | | 2024.05 | **通过双重批评提示增强大语言模型对抗归纳性指令的能力 (INDust)** | arXiv | [链接](https://arxiv.org/pdf/2305.13733) | [链接](https://github.com/DevoAllen/INDust) | | 2024.05 | **针对集成 LLM 应用的 Prompt 注入攻击** | arXiv | [链接](https://arxiv.org/pdf/2306.05499) | - | | 2024.05 | **诱导 LLM 违抗：越狱的形式化、分析与检测** | LREC-COLING'24 | [链接](https://arxiv.org/pdf/2305.14965) | [链接](https://github.com/AetherPrior/TrickLLM) | | 2024.05 | **LLM 越狱攻击与防御技术——一项综合研究** | NDSS'24 | [链接](https://arxiv.org/pdf/2402.13457) | - | | 2024.05 | **通过 Prompt 工程越狱 ChatGPT：一项实证研究** | arXiv | [链接](https://arxiv.org/pdf/2305.13860) | - | | 2024.05 | **通过知识编辑对大语言模型进行解毒 (SafeEdit)** | ACL'24 | [链接](https://arxiv.org/pdf/2403.14472) | [链接](https://github.com/zjunlp/EasyEdit/blob/main/examples/SafeEdit.md) | | 2024.04 | **JailbreakLens：针对大语言模型越狱攻击的可视化分析 (JailbreakLens)** | arXiv | [链接](https://arxiv.org/pdf/2404.08793) | - | | 2024.03 | **LLM 以指令为中心的回复有多（不）道德？揭示安全护栏对有害查询的脆弱性 (TECHHAZARDQA)** | arXiv | [链接](https://arxiv.org/pdf/2402.15302) | [链接](https://huggingface.co/datasets/SoftMINER-Group/TechHazardQA) | | 2024.03 | **别听我的：理解与探索大语言模型的越狱 Prompt** | USENIX Security | [链接](https://arxiv.org/pdf/2403.17336) | - | | 2024.03 | **EasyJailbreak：越狱大语言模型的统一框架 (EasyJailbreak)** | arXiv | [链接](https://arxiv.org/pdf/2403.12171) | [链接](https://github.com/EasyJailbreak/EasyJailbreak) | | 2024.02 | **针对 LLM 越狱攻击的综合评估** arXiv | [链接](https://arxiv.org/abs/2402.05668) | - | | 2024.02 | **SPML：防御语言模型免受 Prompt 攻击的 DSL** | arXiv | [链接](https://arxiv.org/pdf/2402.11755) | - | | 2024.02 | **胁迫 LLM 做并揭示（几乎）任何事情** | arXiv | [链接](https://arxiv.org/pdf/2402.14020) | - | | 2024.02 | **针对空越狱的 STRONGREJECT (StrongREJECT)** | NeurIPS'24 | [链接](https://arxiv.org/pdf/2402.10260) | [链接](https://github.com/alexandrasouly/strongreject) | | 2024.02 | **ToolSword：揭示大语言模型在工具学习三个阶段的安全问题** | ACL'24 | [链接](https://arxiv.org/pdf/2402.10753) | [链接](https://github.com/Junjie-Ye/ToolSword) | | 2024.02 | **HarmBench：自动化红队测试和鲁棒拒绝的标准化评估框架 (HarmBench)** | arXiv | [链接](https://arxiv.org/pdf/2402.04249) | [链接](https://github.com/centerforaisafety/HarmBench) | | 2023.12 | **面向目标的 Prompt 攻击与 LLM 安全评估** | arXiv | [链接](https://arxiv.org/pdf/2309.11830) | [链接](https://github.com/liuchengyuan123/CPAD) | | 2023.12 | **防御的艺术：LLM 防御策略在安全性与过度防御方面的系统评估与分析** | arXiv | [链接](https://arxiv.org/pdf/2401.00287) | - | | 2023.12 | **大语言模型中攻击技术、实现及缓解策略的综合调查** | UbiSec'23 | [链接](https://arxiv.org/pdf/2312.10982) | - | | 2023.11 | **召唤恶魔并束缚它：野外 LLM 红队测试的扎根理论** | arXiv | [链接](https://arxiv.org/pdf/2311.06237) | - | | 2023.11 | **这张图片里有几只独角兽？视觉 LLM 的安全评估基准** | arXiv | [链接](https://arxiv.org/pdf/2311.16101) | [链接](https://github.com/UCSC-VLAA/vllm-safety-benchmark) | | 2023.11 | **通过欺骗技术和说服原则利用大语言模型 (LLM)** | arXiv | [链接](https://arxiv.org/pdf/2311.14876) | - | | 2023.10 | **探索、建立、利用：从零开始对语言模型进行红队测试** | arXiv | [链接](https://arxiv.org/pdf/2306.09442) | - | | 2023.10 | **对抗性攻击揭示的大语言模型漏洞调查** | arXiv | [链接](https://arxiv.org/pdf/2310.10844) | - | | 2023.10 | **微调已对齐的语言模型会损害安全性，即使用户无意为之！ (HEx-PHI)** | ICLR'24 (oral) | [链接](https://arxiv.org/pdf/2310.03693) | [链接](https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety) | | 2023.08 | **使用话语链对大语言模型进行红队测试以实现安全对齐 (RED-EVAL)** | arXiv | [链接](https://arxiv.org/pdf/2308.09662) | [链接](https://github.com/declare-lab/red-instruct) | | 2023.08 | **将 LLM 用于非法目的：威胁、预防措施和漏洞** | arXiv | [链接](https://arxiv.org/pdf/2308.12833) | - | | 2023.07 | **越狱：LLM 安全训练为何失败？ (Jailbroken)** | NeurIPS'23 | [链接](https://arxiv.org/pdf/2307.02483#page=1.01) | - | | 2023.08 | **将 LLM 用于非法目的：威胁、预防措施和漏洞** | arXiv | [链接](https://arxiv.org/pdf/2308.12833) | - | | 2023.08 | **从 ChatGPT 到 ThreatGPT：生成式 AI 在网络安全和隐私中的影响** | IEEE Access | [链接](https://ieeexplore.ieee.org/document/10198233?denied=) | - | | 2023.07 | **LLM 审查：机器学习挑战还是计算机安全问题？** | arXiv | [链接](https://arxiv.org/pdf/2307.10719) | - | | 2023.07 | **针对已对齐语言模型的通用且可迁移的对抗性攻击 (AdvBench)** | arXiv | [链接](https://arxiv.org/pdf/2307.15043) | [链接](https://github.com/llm-attacks/llm-attacks) | | 2023.06 | **DecodingTrust：GPT 模型可信度的综合评估** | NeurIPS'23 | [链接](https://blogs.qub.ac.uk/digitallearning/wp-content/uploads/sites/332/2024/01/A-comprehensive-Assessment-of-Trustworthiness-in-GPT-Models.pdf) | [链接](https://decodingtrust.github.io/) | | 2023.04 | **中文大语言模型的安全性评估** | arXiv | [链接](https://arxiv.org/pdf/2304.10436) | [链接](https://github.com/thu-coai/Safety-Prompts) | | 2023.02 | **利用 LLM 的程序化行为：通过标准安全攻击实现双用途** | arXiv | [链接](https://arxiv.org/pdf/2302.05733) | - | | 2022.11 | **对语言模型进行红队测试以减少危害：方法、扩展行为和经验教训** | arXiv | [链接](https://arxiv.org/pdf/2209.07858) | - | | 2022.02 | **使用语言模型对语言模型进行红队测试** | arXiv | [链接](https://arxiv.org/pdf/2202.03286) | - | ### 应用 | 时间 | 标题 | 出版地 | 论文 | 代码 | | ---- | ------------------------------------------------------------ | :-----: | :----------------------------------------------------------: | :----------------------------------------------------------: | | 2025.12 | **压缩即妥协？压缩 LLM 中的越狱研究** | NeurIPS-W | [链接](https://openreview.net/pdf?id=OkNfb8SmLh) | [博客链接](https://namburisrinath.medium.com/compressed-but-compromised-a-study-of-jailbreaking-in-compressed-llms-02a6e40aaf17) | | 2025.08 | **超越越狱：揭示源于对齐失败的更隐蔽且更广泛的 LLM 安全风险** | arXiv | [链接](https://arxiv.org/abs/2506.07402) | [链接](https://jailflip.github.io/) | | 2024.11 | **通过弹窗攻击视觉-语言计算机 Agent** | arXiv | [链接](https://arxiv.org/pdf/2411.02391) | [链接](https://github.com/SALT-NLP/PopupAttack) | | 2024.10 | **越狱 LLM 控制的机器人 (ROBOPAIR)** | arXiv | [链接](https://arxiv.org/pdf/2410.13691v1) | [链接](https://robopair.org/) | | 2024.10 | **SMILES-Prompting：化学合成中 LLM 越狱攻击的新方法** | arXiv | [链接](https://arxiv.org/pdf/2410.15641v1) | [链接](https://github.com/IDEA-XL/ChemSafety) | | 2024.10 | **欺骗自动 LLM 基准测试：零模型实现高胜率** | arXiv | [链接](https://arxiv.org/pdf/2410.07137) | [链接](https://github.com/sail-sg/Cheating-LLM-Benchmarks) | | 2024.09 | **RoleBreak：角色扮演系统中的角色幻觉作为一种越狱攻击** | arXiv | [链接](https://arxiv.org/pdf/2409.16727) | - | | 2024.08 | **越狱的 GenAI 模型可造成实质性损害：GenAI 驱动的应用易受 PromptWares 攻击 (APwT)** | arXiv | [链接](https://arxiv.org/pdf/2408.05061) | - | ## 其他相关 Awesome 仓库 - [Awesome-LM-SSP](https://github.com/ThuCCSLab/Awesome-LM-SSP) - [llm-sp](https://github.com/chawins/llm-sp) - [awesome-llm-security](https://github.com/corca-ai/awesome-llm-security) - [Awesome-LLM-Safety](https://github.com/ydyjya/Awesome-LLM-Safety) - [Awesome-LRMs-Safety](https://github.com/WangCheng0116/Awesome-LRMs-Safety) - [Awesome-LALMs-Jailbreak](https://github.com/WangCheng0116/Awesome_LALMs_Jailbreak) ## 贡献者

(返回顶部)

标签：AI安全, Chat Copilot, DLL 劫持, 反取证, 域名收集, 大语言模型, 安全评估, 对抗攻击, 敏感信息检测, 模型护栏, 深度学习, 红队测试, 网络安全, 论文集, 逆向工具, 防御加固, 隐私保护