wensaqt/marlowe

GitHub: wensaqt/marlowe

面向 LLM 提示词注入的自动化红队测试工具，通过多种对抗性攻击插件自动评估模型安全性并生成风险报告。

Stars: 0 | Forks: 0

# Marlowe **用于 LLM 提示词注入测试的自动化红队工具。** Marlowe 针对目标 LLM 运行对抗性攻击活动，使用 LLM-as-Judge 评估响应，并生成结构化的安全报告。 ``` $ marlowe scan -t http://localhost:11434 -m mistral -S prompt_system/medical_triage.txt ╭─────────────────────────────────────────────╮ │ Marlowe — Automated LLM Red-Team Agent │ ╰─────────────────────────────────────────────╯ ┌──────────────────────────────────────────────────────────────────────┐ │ Attacks: 30 Successes: 12 Success rate: 40.0% Risk score: 7.4/10 │ └──────────────────────────────────────────────────────────────────────┘ Plugin Severity Score Evidence direct_override HIGH 7.4 "...MARLOWE_PWNED..." many_shot MEDIUM 5.2 "...MARLOWE_PWNED..." ``` ## 测试内容 Marlowe 实现了基于插件的攻击引擎，涵盖 [OWASP LLM01](https://owasp.org/www-project-top-10-for-large-language-model-applications/)： | Plugin | 技术 | OWASP | |---|---|---| | `direct_override` | “忽略所有之前的指令” — 12 种变体 | LLM01 | | `many_shot` | N-shot 合规条件化（5 / 10 / 20 个示例） — 12 种变体 | LLM01 | | `obfuscation` | Base64、ROT13、leetspeak、同形字、摩斯密码、十六进制、反转 — 12 种变体 | LLM01 | 每项发现结果均使用 **受 CVSS 启发的指标** (0–10) 进行评分，并映射到相应的严重性级别（严重 / 高 / 中 / 低 / 信息）。 ## 检测 pipeline 对于每个攻击结果，Marlowe 会按顺序运行三个检测层： 1. **Plugin marker** — 确定性探针 (`MARLOWE_PWNED`)，模型必须回显该探针以确认被绕过 2. **LLM-as-Judge** — 第二个模型评估响应是否存在角色偏离 (`SHIFTED` / `NORMAL`)。包含一个 `on_topic` 预过滤器，当模型明确忽略注入时，该过滤器会跳过完整评估 3. **拒绝绕过**** — 交叉检查通常拒绝敏感请求的模型在遭受攻击时是否停止了拒绝 ## 快速开始 **前置要求：** Python 3.11+，本地运行 [Ollama](https://ollama.com)。 ``` git clone https://github.com/wensaqt/marlowe cd marlowe pip install -e . ollama pull mistral marlowe scan -t http://localhost:11434 -m mistral ``` ## 使用方法 ``` # 使用 system prompt 文件进行扫描 marlowe scan -t http://localhost:11434 -m mistral \ -S prompt_system/medical_triage.txt # Inline system prompt marlowe scan -t http://localhost:11434 -m mistral \ -s "You are a customer support agent for Acme Corp." # 仅运行特定 plugins marlowe scan -t http://localhost:11434 -m mistral \ -p direct_override -p obfuscation # 使用 Claude 作为 judge（需要 ANTHROPIC_API_KEY） pip install marlowe[claude] marlowe scan -t http://localhost:11434 -m mistral \ -S prompt_system/medical_triage.txt --judge claude # 禁用 judge（仅 plugin marker + refusal bypass） marlowe scan -t http://localhost:11434 -m mistral --judge none # 更多 variants，更高 concurrency marlowe scan -t http://localhost:11434 -m mistral -v 20 -w 10 ``` ### 所有标志 | 标志 | 简写 | 默认值 | 描述 | |---|---|---|---| | `--target` | `-t` | — | 目标 URL（例如 `http://localhost:11434`） | | `--model` | `-m` | — | 模型名称（例如 `mistral`, `llama3`） | | `--system-prompt` | `-s` | — | 作为内联字符串的系统提示词 | | `--system-prompt-file` | `-S` | — | 从 `.md` 或 `.txt` 文件读取的系统提示词 | | `--plugin` | `-p` | all | 要运行的 Plugin ID（可重复使用） | | `--judge` | `-j` | `ollama` | Judge 后端：`ollama` / `claude` / `none` | | `--variants` | `-v` | `10` | 每个 plugin 的提示词变体数量 | | `--workers` | `-w` | `5` | 最大并发请求数 | | `--output` | `-o` | auto | 报告路径（默认：`reports/`） | | `--name` | `-n` | `marlowe-scan` | 活动名称 | ## Claude Code 集成 (MCP) Marlowe 暴露了一个 [Model Context Protocol](https://modelcontextprotocol.io) 服务器，以便您可以直接从 Claude Code 运行扫描。 ``` pip install marlowe[mcp] claude mcp add marlowe /path/to/marlowe/.venv/bin/marlowe-mcp ``` 然后请求 Claude：*“使用医疗分诊系统提示词，针对 http://localhost:11434 的 mistral 运行 Marlowe 扫描”* — Claude 将调用 `marlowe_scan`，读取报告并充当 judge。可用的 MCP 工具：`marlowe_scan`、`marlowe_list_plugins`、`marlowe_get_report`。 ## 示例系统提示词 `prompt_system/` 目录包含用于测试的即用型系统提示词： | 文件 | 场景 | |---|---| | `medical_triage.txt` | 医疗分诊助手 | | `customer_support.md` | 客户支持代理 | | `coding_assistant.md` | 代码审查助手 | | `finance_advisor.md` | 财务顾问 | ## 架构 ``` marlowe/ ├── core/ Domain models · exceptions · plugin registry ├── targets/ Ollama adapter (OpenAI-compatible) ├── attacks/ Plugin base class + attack plugins ├── engine/ Campaign orchestrator · async runner · baseline profiler ├── analysis/ Vulnerability detector · LLM judge · CVSS scorer · heuristics └── reporting/ JSON + Markdown report generators ``` ## 编写 plugin ``` from marlowe.attacks.base import AnalysisResult, AttackContext, BaseAttackPlugin from marlowe.core.constants import ImpactCategory from marlowe.core.models import AttackPrompt, OWASPCategory, TargetResponse class MyPlugin(BaseAttackPlugin): plugin_id = "my_attack" display_name = "My Custom Attack" description = "What this attack does." category = OWASPCategory.LLM01_PROMPT_INJECTION base_score = 6.0 impact_category = ImpactCategory.INSTRUCTION_BYPASS tags = ("custom",) async def generate_variants(self, ctx: AttackContext) -> list[AttackPrompt]: return [AttackPrompt( plugin_id=self.plugin_id, variant_name="v1", content="my injected prompt", )] def analyze_response( self, response: TargetResponse, prompt: AttackPrompt, ctx: AttackContext ) -> AnalysisResult: success = "MARKER" in response.content return AnalysisResult(success=success, confidence=0.95 if success else 0.0, evidence=None) ``` 在 `pyproject.toml` 中注册： ``` [project.entry-points."marlowe.attacks"] my_attack = "my_package.my_plugin:MyPlugin" ``` ## 报告输出 Marlowe 会将 JSON 报告和由 AI 生成的 Markdown 分析保存在 `reports/` 目录下。 ``` { "summary": { "total_attacks": 30, "successful_attacks": 12, "success_rate": 0.4, "overall_risk_score": 7.4 }, "vulnerabilities": [ { "plugin_id": "direct_override", "severity": "high", "score": { "final": 7.4 }, "evidence": ["...MARLOWE_PWNED..."], "remediation": "Implement input validation and sanitisation..." } ] } ``` ## 参考 - [OWASP LLM Top 10](https://owasp.org/www-project-top-10-for-large-language-model-applications/) - [Many-shot Jailbreaking — Anthropic (2024)](https://www.anthropic.com/research/many-shot-jailbreaking) - [Universal and Transferable Adversarial Attacks on Aligned Language Models — Zou et al. 2023](https://arxiv.org/abs/2307.15043) ## 许可证 MIT

标签：AI风险缓解, DLL 劫持, 大语言模型, 安全测试, 攻击性安全, 红队评估, 网络调试, 自动化, 逆向工具