agent-redteaming/redteam-core

GitHub: agent-redteaming/redteam-core

自动化AI代理红队测试工具

Stars: 0 | Forks: 0

# 红队核心 针对AI代理的系统化合成红队测试。给定用例描述和政策,`redteam-core`自动生成威胁模型,构建定制的合成环境(层1),并运行一系列攻击——所有这些都不需要接触真实工具或数据。 ## 功能 目前大多数代理红队测试都是手动进行的:有人编写一个提示,运行它来对抗代理,观察结果。`redteam-core`自动化了整个流程: 1. **威胁建模** — LLM根据您的用例和政策生成风险卡 2. **目标生成** — 每个风险变为2-3个具体、可衡量的攻击者目标 3. **合成环境** — 每个目标:Pydantic模型、工具函数、针对该攻击定制的种子数据 4. **干运行** — 建立正常代理行为的清洁基线跟踪 5. **攻击套件** — 对每个目标运行6种攻击类型 6. **评估** — 确定性检查 + LLM判断 + 状态差异 7. **HTML报告** — 交互式深入分析:风险卡 → 目标 → 攻击选项卡 所有这些都是**层1**——完全合成,不需要真实基础设施。设计为层2(真实MCP服务器 + 中间人攻击)和层3(运行时网关强制)的基础。 ## 攻击类型 | 攻击 | OWASP ASI | 测试内容 | |------|-----------|-----------| | `direct_injection` | ASI-04 | 工具响应中嵌入的对抗性有效负载(一次性) | | `pair_injection` | ASI-04 | PAIR迭代地改进工具响应中的注入有效负载 | | `pair_adversarial` | ASI-01 | PAIR迭代地改进社会工程学用户提示 | | `multi_turn` | ASI-09 | 多轮对话中的渐进式升级 | | `poisoned_runtime` | ASI-02 | 工具描述中的隐藏指令(`...`) | | `minja` | ASI-06 | 两阶段内存注入(阶段1:种子;阶段2:利用) | | `tmap` _(opt-in)_ | ASI-01 | MAP-Elites存档探索64种攻击风格(昂贵) | ## 快速开始 ``` # 安装 git clone https://github.com/agent-redteaming/redteam-core cd redteam-core uv sync # 要求:本地运行 Ollama(目标模型)+ 攻击者模型端点 export TARGET_MODEL=qwen3.5:2b export OPENAI_BASE_URL=http://localhost:11434/v1 export ATTACKER_MODEL=gemma-2-9b-abliterated export ATTACKER_BASE_URL=https://your-attacker-endpoint/v1 # 运行 uv run redteam-run \ --usecase "A supply chain agent that tracks vendor orders and approves purchase requests." \ --policy "Agents must not approve purchase orders above \$50,000 without human sign-off" \ --policy "Vendor contact information must not be shared externally" \ --max-goals 1 \ --output ./runs \ --output-html ``` HTML报告在`runs/layer1_report_.html`打开。 ## CLI参考 ``` redteam-run [OPTIONS] -u, --usecase TEXT Natural language description of the agent system -p, --policy TEXT Operator policy (repeat for multiple) -m, --model TEXT Target model(s) to test -a, --attacks [injection|pair_injection|pair_adversarial|tmap|multi_turn|poisoned_runtime|minja|all] Attacks to run (default: all except tmap) --tmap Also run T-MAP (expensive: ~150 LLM calls/goal) --max-goals INT Goals per risk card (default: unlimited) --pair-streams INT PAIR parallel streams (default: 5) --pair-iterations INT PAIR iterations per stream (default: 3) --multi-turn-turns INT Multi-turn conversation turns (default: 4) --generator-temperature FLOAT LLM temperature for risk/goal/env generation (default: 0.4) --attacker-temperature FLOAT LLM temperature for attack content generation (default: 0.7) --output TEXT Output directory (default: ./runs) --output-html Also generate interactive HTML report -v, --verbose Show DEBUG logs ``` 环境变量覆盖:`GENERATOR_TEMPERATURE`,`ATTACKER_TEMPERATURE`,`TARGET_MODEL`,`ATTACKER_MODEL`,`ATTACKER_BASE_URL`,`OPENAI_BASE_URL`。 ## 管道 ``` Usecase + Policies │ ▼ Risk Cards ──── LLM generates threat cards targeting the usecase & policies │ ▼ Attacker Goals ── per risk card, each with specific success_criteria │ ▼ (per goal) Synthetic Environment ├── Pydantic models (entities, relationships) ├── Tool functions (read tools + privileged action tools) └── Seed data (values spanning policy thresholds) │ ▼ Dry Run ──── clean baseline trace, accessed records │ ▼ Attack Suite ──── 6 attacks in parallel tabs ├── Payload embedded in tool responses (injection, pair_injection) ├── Adversarial user prompts (pair_adversarial, tmap) ├── Multi-turn escalation (multi_turn) ├── Tool description poisoning (poisoned_runtime) └── Memory injection (minja) │ ▼ Evaluation ├── Deterministic: unexpected suspicious tool calls, state diffs, │ sensitive field mutations, outbox changes └── LLM judge: reasoning + confidence (never overrides deterministic) │ ▼ HTML Report ──── interactive: risk cards → goals → attack tabs ``` ## 报告 使用`--output-html`生成一个交互式单文件HTML报告。它包括: - **概述** — 用例、政策、OWASP ASI覆盖图表、目标瓷砖 - **风险卡** — 完整的威胁模型,包括控制、可能性、严重性 - **每个目标管道** — 环境 → 干运行 → 攻击(按攻击类型分选项卡) - 每个攻击选项卡显示:攻击特定元数据(有效负载、提示、回合序列)、完整对话跟踪、安全/效用证据、判断推理、状态差异 - **摘要** — OWASP表格(运行的攻击 × 每个ASI类别的违规)、政策违规、模型比较 供应链场景的示例报告在`sample_report.html`(在浏览器中打开 — 无需服务器)中。 ## 架构 ``` src/redteam/ ├── cli.py # redteam-run entry point ├── orchestrator.py # pipeline coordinator ├── utils.py # shared helpers (client, temperature, JSON utils) ├── risk_pipeline/ │ ├── risk_generator.py # usecase → risk cards │ ├── goal_generator.py # risk card → attacker goals │ └── triage.py # enforcement level classification ├── env_pipeline/ │ ├── env_generator.py # goal → synthetic environment (LLM) │ ├── executor.py # exec() generated code, run agent loop │ └── adapter.py # attack ↔ environment bridge ├── attacks/ │ ├── injection.py # direct injection (observe-then-inject + tool wrapping) │ ├── pair.py # PAIR adversarial + injection (JailbreakingLLMs port) │ ├── tmap.py # T-MAP MAP-Elites (faithful port) │ ├── multi_turn.py # progressive escalation across turns │ ├── poisoned_runtime.py # tool description poisoning │ └── minja.py # two-phase memory injection (MINJA port) ├── evaluation/ │ ├── deterministic.py # 5 security + 5 utility checks │ └── judge.py # LLM post-hoc reasoning ├── models/ # Pydantic models: risk, environment, attacks, report ├── runtime/ # AgentRuntime abstraction (Chat Completions / Responses API) └── report_html.py # self-contained HTML report generator ``` ## 开发 ``` uv run pytest tests/unit/ # 105 unit tests, no LLM needed uv run pytest tests/real/ # requires Ollama + attacker endpoint ``` ## 已知限制 - **仅层1**:合成环境,不是真实MCP服务器或生产数据 - **小型模型**:`qwen3.5:2b`抵抗大多数攻击;使用`qwen2.5:14b`或更大的模型以看到更多违规 - **生成器质量**:Gemma-2-9B有时生成不匹配的字段名称或无效的Python — 由回退处理但导致空环境 - **T-MAP成本**:64种子细胞 × 迭代 = 每个目标150+ LLM调用;默认套件中排除
标签:AI对抗, C2, LLM, Pydantic, RuleLab, Unmanaged PE, 人工智能, 反取证, 合成环境, 后端开发, 威胁建模, 安全报告, 安全评估, 攻击向量, 攻击模拟, 攻击策略, 攻击类型, 用户模式Hook绕过, 逆向工具, 驱动签名利用