tsondo/givewell_redteam

GitHub: tsondo/givewell_redteam

一个使用多智能体流水线对研究分析进行红队演练的概念验证系统，通过架构改进提升批评质量并消除幻觉。

Stars: 0 | Forks: 0

# GiveWell AI 红队演练 — 多智能体流水线一个概念验证型多智能体流水线，用于对 GiveWell 的成本效益分析进行红队演练。展示了架构改进（分解、验证、量化、对抗性测试）优于单次提示进行的研究批评。 ## 起源 GiveWell 发布了一份关于 [AI 红队演练的报告](https://www.givewell.org/how-we-work/our-criteria/cost-effectiveness/ai-red-teaming-12-25)（2026 年 1 月），描述了他们使用 ChatGPT 5 Pro 和结构化提示进行的实验。他们的结果：约 15–30% 的 AI 批评是有用的，且存在持续的幻觉、上下文丢失和不可靠的量化估计问题。 ## 论点 GiveWell 的局限性是架构局限性，而非模型局限性。分解、验证流水线、对抗性智能体结构、作用域受限的上下文以及工具增强的量化推理各自独立改善结果——而且它们相互叠加。 ## 结果仅使用公开材料和商业模型（大多数阶段使用 Claude Sonnet，分解和量化使用 Opus）分析的三项干预措施： | 干预措施 | 批评数量 | 存活率 | 新发现 | 成本 | |---|---|---|---|---| | 水氯化 | 31 个生成 → 26 个存活 | 84% | 10 个新发现 | ~$30（v1） | | 杀虫剂处理蚊帐 | 30 个生成 → 30 个存活 | 100% | 9 个新发现 | $16.00 | | 季节性疟疾化学预防 | 34 个生成 → 28 个存活 | 82% | 6 个新发现 | $15.96 | GiveWell 的基线：约 15–30% 有用。我们的目标：>60%。三项运行均超过该目标。所有运行中零幻觉引用——假设生成（Investigators）和证据检索（Verifier）的架构分离消除了伪造问题。完整分析、局限性和跨干预模式见[结论](docs/conclusions.md)。 ## 流水线架构 ``` 1. DECOMPOSER (Opus) → 11 investigation threads with scoped specs 2. INVESTIGATORS (Sonnet×11) → ~30 candidate critiques with cited evidence 3. VERIFIER (Sonnet, batched) → Citations checked, claims grounded, or rejected 4. QUANTIFIER (Opus) → Sensitivity analysis against actual CEA spreadsheets 5. ADVERSARIAL PAIR (Sonnet) → Advocate + Challenger stress-test each critique 6. SYNTHESIZER (Opus) → Ranked final report with evidence and impact estimates ``` 关键设计原则： - **仅使用公开材料作为输入。** 证明改进来自方法论，而非特权访问。 - **每个智能体作用域受限的上下文。** 每个智能体获得一个聚焦的上下文文档，而非整个文件柜。 - **验证是头等大事。** 每个事实主张在到达人类之前都通过网页搜索独立检查。 - **量化 grounding。** 影响主张与针对实际电子表格模型的计算扰动挂钩。 ## 仓库结构 ``` ├── README.md ├── docs/ │ ├── architecture.md # Pipeline design document │ ├── conclusions.md # Results analysis, limitations, cross-intervention patterns │ ├── analysis-givewell.md # Critique of GiveWell's current approach │ └── pipeline-spec.md # Build spec with schemas and implementation details ├── prompts/ # System prompts for each agent (7 files) ├── pipeline/ │ ├── run_pipeline.py # Main orchestrator (--resume-from support) │ ├── agents.py # All agent callers (decomposer through synthesizer) │ ├── spreadsheet.py # CEA readers: WaterCEA, ITNCEA, MalariaCEA │ ├── schemas.py # Dataclasses for pipeline data flow │ └── config.py # API keys, model selection, cost thresholds ├── data/ # CEA spreadsheets (.xlsx, read-only) ├── results/ │ ├── water-chlorination/ # All 6 stage outputs (.json + .md) + stats │ ├── itns/ # All 6 stage outputs (.json + .md) + stats │ └── smc/ # All 6 stage outputs (.json + .md) + stats └── tests/ # 62 tests (spreadsheet readers, parsers, schemas) ``` ## 使用方法 ``` # 安装依赖项 (anthropic, openpyxl, pandas, python-dotenv) pip install -r requirements.txt # 运行单个干预 python -m pipeline.run_pipeline water-chlorination python -m pipeline.run_pipeline itns python -m pipeline.run_pipeline smc # 从特定阶段恢复 python -m pipeline.run_pipeline itns --resume-from quantifier # 运行测试 python -m pytest tests/ -v ``` 需要在 `.env` 中配置 Anthropic API 密钥： ``` ANTHROPIC_API_KEY=sk-ant-... ``` ## 约束条件 - **不使用智能体框架。** 不使用 LangChain、CrewAI、AutoGen——仅使用直接的 Anthropic SDK 调用。 - **四个依赖项：** anthropic、openpyxl、pandas、python-dotenv。 - **预算：** 所有干预措施总计 $50。实际支出：约 $62（水氯化 v1 在验证器优化前超支；后续运行在目标范围内）。 ## 关键文件 | 文件 | 用途 | |---|---| | `results/*/06-synthesizer.md` | 每项干预的最终红队报告 | | `docs/conclusions.md` | 跨干预分析和诚实的局限性 | | `docs/architecture.md` | 完整流水线规范 | | `pipeline/spreadsheet.py` | 三个 CEA 公式链复制及敏感性分析 |

标签：AI安全, AI红队, Chat Copilot, Claude, CVE检测, DLL 劫持, PyRIT, 任务分解, 反幻觉, 多智能体系统, 大语言模型, 定量推理, 对抗性测试, 成本效益分析, 智能体架构, 研究批判, 管线系统, 结构化分析, 评估框架, 逆向工具, 验证管道