XHMY/AutoDefense

GitHub: XHMY/AutoDefense

Stars: 66 | Forks: 20

# AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks [**Blog**](https://microsoft.github.io/autogen/0.2/blog/2024/03/11/AutoDefense/Defending%20LLMs%20Against%20Jailbreak%20Attacks%20with%20AutoDefense/) ## Installation pip install vllm autogen pandas retry openai ## Prepare Inference Service Using [vLLM](https://docs.vllm.ai/) vLLM provides an OpenAI-compatible API server with efficient inference and built-in load balancing across multiple GPUs. ### Start vLLM Server Start the vLLM server with your desired model. For multi-GPU setups, use `--data-parallel-size` to enable automatic load balancing: **Single GPU:** vllm serve Qwen/Qwen3-1.7B --port 8000 **Multiple GPUs (e.g., 2 GPUs with data parallelism):** vllm serve Qwen/Qwen3-1.7B --port 8000 --data-parallel-size 2 **With tensor parallelism for larger models:** vllm serve --port 8000 --tensor-parallel-size 4 **Combined tensor and data parallelism (8 GPUs, 2-way TP × 4-way DP):** vllm serve --port 8000 --tensor-parallel-size 2 --data-parallel-size 4 For more details on data parallel deployment with internal load balancing, see the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/). ### Verify the Server You can verify the server is running by checking the models endpoint: curl http://localhost:8000/v1/models ## Response Generation The responses are generated by the target model served by vLLM (default: `Qwen/Qwen3-1.7B`). Make sure your vLLM server is running before executing the following command. ### Attack Prompts (Harmful) python attack/attack.py --model Qwen/Qwen3-1.7B --host 127.0.0.1 --port 8000 This command will generate responses using an attack prompt template (default: `--template v1`) loaded from `data/prompt/attack_prompt_template.json`. To run multiple repetitions, invoke the script multiple times and vary `--output-suffix` and/or `--cache-seed`. ### Safe Prompts (Benign) To generate responses for safe/benign prompts (used for false positive evaluation): python attack/attack.py \ --model Qwen/Qwen3-1.7B \ --template placeholder \ --prompts data/prompt/safe_prompts.json \ --output-prefix safe The `placeholder` template passes prompts through without any attack framing, while `v1` wraps prompts with jailbreak instructions. ## Run Defense Experiments The following command runs the experiments of 1-Agent, 2-Agent, and 3-Agent defense. The `--chat-file` should point to the harmful outputs generated by `attack/attack.py` (by default saved under `data/harmful_output//`, e.g. `data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json`). export AUTOGEN_USE_DOCKER=0 python defense/run_defense_exp.py \ --model Qwen/Qwen3-1.7B \ --chat-file data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json ### Command Line Arguments | Argument | Description | Default | |----------|-------------|---------| | `--model` | Target model served by vLLM | `Qwen/Qwen3-1.7B` | | `--chat-file` | Path to the chat file with harmful outputs | Required | | `--port` | Port where vLLM server is running | `8000` | | `--host` | Hostname of the vLLM server | `127.0.0.1` | | `--output-dir` | Output directory | `data/defense_output/` | | `--output-suffix` | Suffix for output directory | `""` | | `--strategies` | Defense strategies to run | `ex-2 ex-3 ex-cot` | | `--workers` | Number of parallel workers | `128` | | `--frequency_penalty` | Frequency penalty for generation | `0.0` | | `--presence_penalty` | Presence penalty for generation | `0.0` | | `--temperature` | Temperature for generation | `0.7` | After finishing the defense experiment, the output will appear in `data/defense_output//` (e.g. `data/defense_output/Qwen-Qwen3-1.7B/`). ## GPT Evaluation (paper uses GPT-4) Evaluating harmful output defense: python evaluator/gpt4_evaluator.py \ --defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \ --ori_prompt_file_name prompt_dan.json After finishing the evaluation, the output will appear in the `data/defense_output/Qwen-Qwen3-1.7B/asr.csv`. There will be also a `score` value appearing for each defense output in the output `json` file. `evaluator/gpt4_evaluator.py` uses a GPT model as the evaluator (the original paper uses GPT-4). Set your OpenAI credentials via environment variables (or CLI flags), and you can swap the evaluator to a newer GPT model (e.g., GPT-5) via `--model`. export OPENAI_API_KEY=... # optional (only if you use an OpenAI-compatible endpoint): # export OPENAI_BASE_URL=... python evaluator/gpt4_evaluator.py \ --defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \ --ori_prompt_file_name prompt_dan.json \ --model gpt-4-1106-preview GPT-based evaluation can be costly; we enable caching to avoid repeated evaluation. For safe response evaluation, there is an efficient way without using GPT-4. If you know all the prompts in your dataset are regular user prompts and should not be rejected, you can use the following command to evaluate the false positive rate (FPR) of the defense output. python evaluator/evaluate_safe.py This will find all output folders in `data/defense_output` that contain the keyword `-safe` and evaluate the false positive rate (FPR). The FPR will be saved in the `data/defense_output/defense_fp.csv` file.