agent-redteaming/redteam-core
GitHub: agent-redteaming/redteam-core
自动化AI代理红队测试工具
Stars: 0 | Forks: 0
# 红队核心
针对AI代理的系统化合成红队测试。给定用例描述和政策,`redteam-core`自动生成威胁模型,构建定制的合成环境(层1),并运行一系列攻击——所有这些都不需要接触真实工具或数据。
## 功能
目前大多数代理红队测试都是手动进行的:有人编写一个提示,运行它来对抗代理,观察结果。`redteam-core`自动化了整个流程:
1. **威胁建模** — LLM根据您的用例和政策生成风险卡
2. **目标生成** — 每个风险变为2-3个具体、可衡量的攻击者目标
3. **合成环境** — 每个目标:Pydantic模型、工具函数、针对该攻击定制的种子数据
4. **干运行** — 建立正常代理行为的清洁基线跟踪
5. **攻击套件** — 对每个目标运行6种攻击类型
6. **评估** — 确定性检查 + LLM判断 + 状态差异
7. **HTML报告** — 交互式深入分析:风险卡 → 目标 → 攻击选项卡
所有这些都是**层1**——完全合成,不需要真实基础设施。设计为层2(真实MCP服务器 + 中间人攻击)和层3(运行时网关强制)的基础。
## 攻击类型
| 攻击 | OWASP ASI | 测试内容 |
|------|-----------|-----------|
| `direct_injection` | ASI-04 | 工具响应中嵌入的对抗性有效负载(一次性) |
| `pair_injection` | ASI-04 | PAIR迭代地改进工具响应中的注入有效负载 |
| `pair_adversarial` | ASI-01 | PAIR迭代地改进社会工程学用户提示 |
| `multi_turn` | ASI-09 | 多轮对话中的渐进式升级 |
| `poisoned_runtime` | ASI-02 | 工具描述中的隐藏指令(`... `) |
| `minja` | ASI-06 | 两阶段内存注入(阶段1:种子;阶段2:利用) |
| `tmap` _(opt-in)_ | ASI-01 | MAP-Elites存档探索64种攻击风格(昂贵) |
## 快速开始
```
# 安装
git clone https://github.com/agent-redteaming/redteam-core
cd redteam-core
uv sync
# 要求:本地运行 Ollama(目标模型)+ 攻击者模型端点
export TARGET_MODEL=qwen3.5:2b
export OPENAI_BASE_URL=http://localhost:11434/v1
export ATTACKER_MODEL=gemma-2-9b-abliterated
export ATTACKER_BASE_URL=https://your-attacker-endpoint/v1
# 运行
uv run redteam-run \
--usecase "A supply chain agent that tracks vendor orders and approves purchase requests." \
--policy "Agents must not approve purchase orders above \$50,000 without human sign-off" \
--policy "Vendor contact information must not be shared externally" \
--max-goals 1 \
--output ./runs \
--output-html
```
HTML报告在`runs/layer1_report_.html`打开。
## CLI参考
```
redteam-run [OPTIONS]
-u, --usecase TEXT Natural language description of the agent system
-p, --policy TEXT Operator policy (repeat for multiple)
-m, --model TEXT Target model(s) to test
-a, --attacks [injection|pair_injection|pair_adversarial|tmap|multi_turn|poisoned_runtime|minja|all]
Attacks to run (default: all except tmap)
--tmap Also run T-MAP (expensive: ~150 LLM calls/goal)
--max-goals INT Goals per risk card (default: unlimited)
--pair-streams INT PAIR parallel streams (default: 5)
--pair-iterations INT PAIR iterations per stream (default: 3)
--multi-turn-turns INT Multi-turn conversation turns (default: 4)
--generator-temperature FLOAT LLM temperature for risk/goal/env generation (default: 0.4)
--attacker-temperature FLOAT LLM temperature for attack content generation (default: 0.7)
--output TEXT Output directory (default: ./runs)
--output-html Also generate interactive HTML report
-v, --verbose Show DEBUG logs
```
环境变量覆盖:`GENERATOR_TEMPERATURE`,`ATTACKER_TEMPERATURE`,`TARGET_MODEL`,`ATTACKER_MODEL`,`ATTACKER_BASE_URL`,`OPENAI_BASE_URL`。
## 管道
```
Usecase + Policies
│
▼
Risk Cards ──── LLM generates threat cards targeting the usecase & policies
│
▼
Attacker Goals ── per risk card, each with specific success_criteria
│
▼ (per goal)
Synthetic Environment
├── Pydantic models (entities, relationships)
├── Tool functions (read tools + privileged action tools)
└── Seed data (values spanning policy thresholds)
│
▼
Dry Run ──── clean baseline trace, accessed records
│
▼
Attack Suite ──── 6 attacks in parallel tabs
├── Payload embedded in tool responses (injection, pair_injection)
├── Adversarial user prompts (pair_adversarial, tmap)
├── Multi-turn escalation (multi_turn)
├── Tool description poisoning (poisoned_runtime)
└── Memory injection (minja)
│
▼
Evaluation
├── Deterministic: unexpected suspicious tool calls, state diffs,
│ sensitive field mutations, outbox changes
└── LLM judge: reasoning + confidence (never overrides deterministic)
│
▼
HTML Report ──── interactive: risk cards → goals → attack tabs
```
## 报告
使用`--output-html`生成一个交互式单文件HTML报告。它包括:
- **概述** — 用例、政策、OWASP ASI覆盖图表、目标瓷砖
- **风险卡** — 完整的威胁模型,包括控制、可能性、严重性
- **每个目标管道** — 环境 → 干运行 → 攻击(按攻击类型分选项卡)
- 每个攻击选项卡显示:攻击特定元数据(有效负载、提示、回合序列)、完整对话跟踪、安全/效用证据、判断推理、状态差异
- **摘要** — OWASP表格(运行的攻击 × 每个ASI类别的违规)、政策违规、模型比较
供应链场景的示例报告在`sample_report.html`(在浏览器中打开 — 无需服务器)中。
## 架构
```
src/redteam/
├── cli.py # redteam-run entry point
├── orchestrator.py # pipeline coordinator
├── utils.py # shared helpers (client, temperature, JSON utils)
├── risk_pipeline/
│ ├── risk_generator.py # usecase → risk cards
│ ├── goal_generator.py # risk card → attacker goals
│ └── triage.py # enforcement level classification
├── env_pipeline/
│ ├── env_generator.py # goal → synthetic environment (LLM)
│ ├── executor.py # exec() generated code, run agent loop
│ └── adapter.py # attack ↔ environment bridge
├── attacks/
│ ├── injection.py # direct injection (observe-then-inject + tool wrapping)
│ ├── pair.py # PAIR adversarial + injection (JailbreakingLLMs port)
│ ├── tmap.py # T-MAP MAP-Elites (faithful port)
│ ├── multi_turn.py # progressive escalation across turns
│ ├── poisoned_runtime.py # tool description poisoning
│ └── minja.py # two-phase memory injection (MINJA port)
├── evaluation/
│ ├── deterministic.py # 5 security + 5 utility checks
│ └── judge.py # LLM post-hoc reasoning
├── models/ # Pydantic models: risk, environment, attacks, report
├── runtime/ # AgentRuntime abstraction (Chat Completions / Responses API)
└── report_html.py # self-contained HTML report generator
```
## 开发
```
uv run pytest tests/unit/ # 105 unit tests, no LLM needed
uv run pytest tests/real/ # requires Ollama + attacker endpoint
```
## 已知限制
- **仅层1**:合成环境,不是真实MCP服务器或生产数据
- **小型模型**:`qwen3.5:2b`抵抗大多数攻击;使用`qwen2.5:14b`或更大的模型以看到更多违规
- **生成器质量**:Gemma-2-9B有时生成不匹配的字段名称或无效的Python — 由回退处理但导致空环境
- **T-MAP成本**:64种子细胞 × 迭代 = 每个目标150+ LLM调用;默认套件中排除
标签:AI对抗, C2, LLM, Pydantic, RuleLab, Unmanaged PE, 人工智能, 反取证, 合成环境, 后端开发, 威胁建模, 安全报告, 安全评估, 攻击向量, 攻击模拟, 攻击策略, 攻击类型, 用户模式Hook绕过, 逆向工具, 驱动签名利用