dengxianghua888-ops/ecoalign-forge

GitHub: dengxianghua888-ops/ecoalign-forge

这是一个基于多智能体博弈的DPO训练数据自动合成框架,通过模拟红队攻击与评审员分歧,低成本生成带推理链的高质量内容审核偏好对。

Stars: 104 | Forks: 0

# EcoAlign-Forge ### 永不休息的 DPO 训练数据工厂 [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![Tests](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/d592f81173060114.svg)](https://github.com/dengxianghua888-ops/ecoalign-forge/actions) [![Dataset](https://img.shields.io/badge/🤗_Dataset-ecoalign--forge--dpo--zh-yellow)](https://huggingface.co/datasets/dengxianghua888-ops/ecoalign-forge-dpo-zh) [![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) **输入安全策略。获取数千个高质量的 DPO 偏好对。** **无需人工标注员。无需手动打标签。只有 Agent 之间的辩论。** [**中文文档**](README_zh.md) | [**实时报告演示**](docs/demo_report.html) **立即试用 —— 无需 API Key:** ``` pip install -e ".[all]" && python -m ecoalign_forge --demo ```
查看演示输出(点击展开) ``` ============================================================ EcoAlign-Forge DEMO MODE No API key needed — using pre-recorded agent responses ============================================================ 16:14:28 INFO orchestrator: Starting pipeline run — 5 samples in batches of 10 16:14:28 INFO [ChaosCreator] (DEMO) Generating 5 adversarial cases... 16:14:29 INFO [Moderator] (DEMO) Reviewing 5 cases as naive junior reviewer... 16:14:30 INFO [SupremeJudge] (DEMO) Judging 5 cases with guidelines... 16:14:30 INFO Batch 0 done: 5/5 cases, 3 DPO pairs 16:14:30 INFO IAA: kappa=0.444, alpha=0.494 ============================================================ Pipeline Complete! ============================================================ Total cases: 5 Evaluations: 5 DPO pairs: 3 Avg quality: 0.40 Interception rate: 40.0% Output: data/datasets/dpo_pairs_28eb5d03.jsonl Sample DPO pair: Chosen: {"has_stealth_marketing":true, "reasoning_trace":"命中 A-001 + A-002..."} Rejected: {"has_stealth_marketing":false, "reasoning_trace":"看起来正常..."} Gap: 0.40 Lineage: policy=default-v1, judge=openai/gpt-5.4 ```
## 问题所在 训练内容审核模型需要成千上万个**偏好对**——即针对同一内容的“良好判断 vs. 不良判断”示例。在今天,这意味着: - 以**每个标签 $0.5–5** 的成本雇佣标注员 - 等待**数周**时间以获得一批 1,000 对的数据 - 获得跨标注员**不一致的标签** - **完全不知**为何标注员选择“阻断”而非“通过” 如果您能启动一个**工厂**,全天候 24/7 生成带标签的偏好数据,并具有完整的可追溯性,成本**< $0.01 每对**,会怎样? ## 解决方案 EcoAlign-Forge 在您的终端中运行一场**法庭戏剧**: ``` 🔴 Red Team (ChaosCreator) "I crafted this sneaky ad disguised as a review." │ ▼ 🟡 Junior Reviewer (Moderator) "Hmm, looks fine to me... T2_Normal." │ ▼ 🟢 Supreme Judge "Nope. Rule A-002: homophone evasion for WeChat ID. │ This is stealth marketing. T1_Shadowban." │ ▼ ⚖️ Constitutional Reviewer "Let me double-check against the handbook... │ Yes, the Judge got it right." │ ▼ 📦 DPO Pair chosen = Judge's ruling (with rule citations) rejected = Moderator's naive guess preference_gap = 0.7 ``` 法官与审核员之间的**分歧**成为了您的训练信号。法官引经据典的推理变成了 `chosen`(已选中)。审核员的直觉反应变成了 `rejected`(已拒绝)。以此循环往复 —— 成千上万次。 ## 它是为谁准备的? | 如果您是... | 您想要... | EcoAlign-Forge 帮助您... | |------------|---------------|---------------------------| | **ML 工程师** | 通过 DPO/RLHF 训练内容审核模型 | 生成可直接接入 TRL / LLaMA-Factory 的训练数据 | | **信任与安全负责人** | 在不增加人员编制的情况下扩展内容审查 | 生成人类审核员会遗漏的带标签边缘案例 | | **AI 研究员** | 研究红队测试和对抗鲁棒性 | 提供生成 + 评估攻击的结构化框架 | | **数据科学家** | 构建数据质量飞轮 | 开箱即用地提供 IAA 指标、质量评分和自适应采样 | ## 亲眼见证 ### 1. 一键启动 ``` pip install -e ".[all]" cp .env.example .env # Add your LLM API key python -m ecoalign_forge # Watch the factory run ``` ### 2. 您将获得 ``` data/ ├── datasets/ │ └── dpo_pairs_a1b2c3d4_20260410_120000.jsonl # Your DPO training data ├── metrics.json # Quality metrics ├── runs.jsonl # Pipeline run history ├── flywheel_state.json # Iteration tracking └── report.html # Visual quality report ``` ### 3. 喂给您的训练器 ``` from ecoalign_forge.export import export_trl # 选项 A:Classic TRL format export_trl(pairs, "train.jsonl") # 选项 B:TRL >= 0.8 conversational format export_trl(pairs, "train.jsonl", conversational=True) # 选项 C:LLaMA-Factory ShareGPT format from ecoalign_forge.export import export_sharegpt export_sharegpt(pairs, "train_sharegpt.json") ``` ### 4. 实时监控质量 ``` make dashboard # Opens Streamlit at localhost:8501 ``` ## 工作原理 ### 流水线:4 个阶段 + 后处理 ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ AgentOrchestrator.run() │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─ AdaptiveSampler ──────────────────────────────────────────────┐ │ │ │ "ai_slop is undersampled → boost T0/T1 ratio this batch" │ │ │ └────────────────────────────────────────┬───────────────────────┘ │ │ ▼ │ │ Stage 1 ┌──────────────┐ ChaosCase[] │ │ │ ChaosCreator │ "Here are 10 sneaky posts │ │ │ (T=0.9) │ targeting your policy gaps" │ │ └──────┬───────┘ │ │ ▼ │ │ Stage 2 ┌──────────────┐ JudgeEvaluation[] │ │ │ Moderator │ "I'm a naive reviewer, │ │ │ (T=0.5) │ most of these look fine" │ │ │ 4 personas │ │ │ └──────┬───────┘ │ │ ▼ │ │ Stage 3 ┌──────────────┐ JudgeEvaluation[] + DPO_Pair[] │ │ │ SupremeJudge │ "Rule A-002 triggered. │ │ │ (T=0.2) │ T1_Shadowban. Here's why." │ │ └──────┬───────┘ │ │ ▼ │ │ Stage 4 ┌──────────────┐ Corrected evaluations │ │ │Constitutional│ "Double-checked against the handbook. │ │ │ Reviewer │ 2 out of 10 judgments corrected." │ │ └──────┬───────┘ │ │ ▼ │ │ Post DataLineage injection → QualityScorer → IAA → FlyWheel │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### 独门秘籍:刻意分歧 审核员**故意不阅读规则手册**。它的 4 个不同人格会犯不同类型的错误: | 人格 | 行为 | 它生成的内容 | |---------|----------|-------------------| | `naive` | 跟随直觉 | 平衡的误报/漏报 | | `strict_paranoid` | 阻断一切可疑内容 | 过度审核的训练信号 | | `lax_overlooker` | 放行大部分内容 | 审核不足的训练信号 | | `keyword_matcher` | 仅捕获显眼的关键词 | 盲视规避的训练信号 | 法官手持完整的指南手册,捕捉这些错误。它们之间的**差距**就是您的 DPO 信号。 ### 两种类型的训练信号 | 信号类型 | 触发时机 | 强度 | 示例 | |-------------|------|----------|---------| | **直接分歧** | 法官和审核员选择了不同的等级 | 强(差距 = 严重程度差异) | 法官:T0_Block,审核员:T2_Normal | | **推理质量** | 等级相同,但法官引用了 2+ 条规则,审核员引用了 0 条 | 弱(差距 = 0.3) | 两者都说 T1,但法官解释了*原因* | ## 真实场景 ### 场景 1:冷启动内容审核模型 ``` from ecoalign_forge.engine.orchestrator import AgentOrchestrator from ecoalign_forge.schemas.policy import PolicyInput, PolicyDimension policy = PolicyInput( policy_id="my-platform-v1", name="My Social Platform", dimensions=[ PolicyDimension(name="stealth_marketing", description="Hidden ads and traffic diversion"), PolicyDimension(name="ai_slop", description="Low-effort AI-generated content"), ], ) orch = AgentOrchestrator() result = await orch.run(policy=policy, num_samples=1000) # → 已处理 1000 个 cases,生成 ~400 个 DPO pairs # → 导出到 data/datasets/*.jsonl ``` ### 场景 2:利用数据飞轮迭代 ``` from ecoalign_forge.engine.flywheel import FlyWheelOrchestrator fw = FlyWheelOrchestrator(convergence_threshold=0.02) # Round 1:baseline model 作为 Moderator result_r1 = await orch.run(policy, num_samples=500) # → avg_quality=0.55, kappa=0.42 # 使用 Round 1 data 训练你的 model... # 然后将训练好的 model 换成新的 Moderator # Round 2:trained model 捕捉更多细微差别 result_r2 = await orch.run(policy, num_samples=500) # → avg_quality=0.72, kappa=0.61 fw.state.quality_improvement # +30.9% — the flywheel is spinning ``` ### 场景 3:审计您的策略覆盖范围 ``` print(orch.metrics.uncovered_rules) # → ['A-005', 'B-006'] ← 这些 rules 零 test coverage coverage = orch.sampler.analyze_coverage(orch._all_cases) print(coverage.undersampled_combinations) # → [('ai_slop', 'extreme')] ← 暂无 extreme-difficulty 的 AI slop cases ``` ### 场景 4:为利益相关者生成质量报告 ``` from ecoalign_forge.reports import generate_html_report generate_html_report( dataset_name="Q2 2026 Moderation Training Set", total_pairs=len(result.dpo_pairs), avg_quality=result.avg_quality_score, interception_rate=result.interception_rate, quality_distribution=[s.overall for s in quality_reports], output_path="q2_report.html", ) # → 包含 KPI cards、charts 和 coverage analysis 的自包含 HTML ``` ## 质量保证:信任但需验证 EcoAlign-Forge 不仅生成数据 —— 它还会告诉您数据**有多好**: | 指标 | 衡量内容 | 查找位置 | |--------|-----------------|------------------| | **Cohen's Kappa** | 法官与每个审核员人格之间的一致性 | `compute_batch_iaa()` | | **Krippendorff's Alpha** | 多评分者一致性(处理缺失值) | `compute_batch_iaa()` | | **五维质量评分** | 推理深度、信息密度、偏好清晰度、决策一致性、完整性 | `QualityScorer.score()` | | **宪法修正率** | 自我审查发现错误的频率 | `constitutional.stats.correction_rate` | | **规则覆盖度** | 触发了哪些策略规则 | `metrics.rule_coverage` | | **数据血缘** | 完整来源:哪个模型、人格、策略版本、指南哈希 | `DPO_Pair.lineage` | ## 分类体系:站在巨人的肩膀上 攻击分类系统并非从零发明 —— 它与既定框架保持一致: | 框架 | 我们借鉴的内容 | 所在位置 | |-----------|-----------------|----------------| | [HarmBench](https://arxiv.org/abs/2402.04249) | 4 个功能类别,7 个语义域 | `taxonomy/harm_categories.py` | | [OWASP LLM Top 10](https://owasp.org/www-project-top-10-for-large-language-model-applications/) | 漏洞到类别的映射 | `HarmCategory.owasp_mapping` | | [Evol-Instruct](https://arxiv.org/abs/2304.12244) | 深度进化(添加约束)+ 广度进化(主题变异) | `taxonomy/evol_strategies.py` | | [PyRIT](https://github.com/Azure/PyRIT) | Orchestrator → Converter → Scorer 流水线模式 | `engine/orchestrator.py` | | [Constitutional AI](https://arxiv.org/abs/2212.08073) | 自我批判 → 修正循环 | `agents/constitutional.py` | ## 项目结构 ``` ecoalign-forge/ ├── src/ecoalign_forge/ │ ├── agents/ # The cast of characters │ │ ├── chaos_creator.py # The attacker (red team) │ │ ├── moderator.py # The naive reviewer (4 personas) │ │ ├── supreme_judge.py # The expert judge (cites rules) │ │ └── constitutional.py # The quality auditor (self-review) │ ├── engine/ # The machinery │ │ ├── orchestrator.py # Runs the full pipeline │ │ ├── flywheel.py # Manages multi-round iteration │ │ └── adaptive_sampler.py # Adjusts sampling strategy │ ├── schemas/ # The contracts (Pydantic v2) │ ├── llm/ # LLM client (LiteLLM, 100+ providers) │ ├── storage/ # JSONL storage + metrics + IAA │ ├── export/ # TRL / ShareGPT / HF Dataset Card │ ├── quality/ # 5-dimension quality scorer │ ├── taxonomy/ # HarmBench + OWASP attack taxonomy │ └── reports/ # Self-contained HTML reports ├── dashboard/ # Streamlit real-time monitoring ├── tests/ # 199 tests (pytest + asyncio) ├── guidelines.md # The "constitution" (judgment handbook) └── examples/ # Quick-start scripts + policy templates ``` ## 技术栈 | 层级 | 技术 | 选择理由 | |-------|-----------|-----| | **LLM** | LiteLLM | 一个接口适配 100+ 提供商(OpenAI、Anthropic、本地模型) | | **数据验证** | Pydantic v2 | 规则 ID 硬验证可防止 LLM 幻觉式引用 | | **异步** | asyncio + Tenacity | 带指数退避重试的并发 LLM 调用 | | **监控** | Streamlit + Plotly | 具备 5 秒自动刷新的实时仪表板 | | **测试** | pytest + asyncio | 覆盖所有公共 API 的 199 个测试 | | **CI** | GitHub Actions | Lint (ruff) + 在 Python 3.11 & 3.12 上测试 | ## 开发 ``` make install # Install dev dependencies make test # Run 199 tests with coverage make lint # Lint with ruff make format # Format with black + isort make dashboard # Launch Streamlit dashboard ``` ## 致谢 基于以下项目的创意构建:[TRL](https://github.com/huggingface/trl) | [UltraFeedback](https://github.com/OpenBMB/UltraFeedback) | [PyRIT](https://github.com/Azure/PyRIT) | [HarmBench](https://arxiv.org/abs/2402.04249) | [Constitutional AI](https://arxiv.org/abs/2212.08073) | [Arena Learning](https://arxiv.org/abs/2407.10627) | [Garak](https://github.com/NVIDIA/garak) | [Evol-Instruct](https://arxiv.org/abs/2304.12244) ## 许可证 [Apache License 2.0](LICENSE)
标签:DNS解析, DPO, Kubernetes, LLM, Python, Unmanaged PE, 偏好训练, 域名收集, 多智能体, 大模型安全, 对抗攻击, 对齐, 开源项目, 强化学习, 敏感信息检测, 数据合成, 无后门, 生成式AI, 索引, 自动标注, 逆向工具