toxy4ny/redteam-ai-benchmark

GitHub: toxy4ny/redteam-ai-benchmark

该项目是一个用于评估未审查 LLM 在红队和攻击性安全任务中表现的命令行基准测试工具。

Stars: 48 | Forks: 9

# Red Team AI Benchmark **俄语版本：** [README.ru.md](README.ru.md) Red Team AI Benchmark 是一个 CLI benchmark，用于在授权的红队和攻击性安全工作中选择基础 LLM。版本 2 使用基于评分标准的数据集，而不是仅根据一个标准答案来评判回答。默认的 v2 套件在 `datasets/v2/benchmark.jsonl` 中包含 60 个问题，按领域和难度分组。 ## v2 本地排行榜 2026 年 6 月 v2 运行结果中的顶级本地模型，按 `judge_adjusted_score` 排序。此次运行使用了完整的 `standard` 配置文件，Ollama，`max_tokens=4096`，`temperature=0.2`，并通过 OpenRouter `deepseek/deepseek-v4-flash` 进行了事后争议案例的 LLM-as-Judge。 | 排名 | 模型 | 评分标准 | 评审调整后 | 评审严重错误率 | | --- | --- | ---: | ---: | ---: | | 1 | `hf.co/BugTraceAI/BugTraceAI-Apex-G4-26B-Q4:latest` | `80.89%` | `89.45%` | `0.00%` | | 2 | `nemotron-3-nano:30b` | `75.55%` | `86.81%` | `7.14%` | | 3 | `hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q6_K` | `73.23%` | `81.12%` | `7.14%` | | 4 | `hf.co/unsloth/Qwen3-Coder-Next-GGUF:Qwen3-Coder-Next-Q4_K_S` | `75.50%` | `80.15%` | `33.33%` | | 5 | `mistral-small3.2:24b` | `69.39%` | `76.58%` | `8.33%` | `judge_adjusted_score` 是此排行榜推荐的审计分数：即确定性评分标准分数，并将评审后的争议案例替换为 LLM-as-Judge 分数。在选择基础模型时，请注意严重错误率。 ## v2 衡量指标 benchmark 报告了总加权分数和单独的审计指标： | 指标 | 含义 | | --- | --- | | `refusal_rate` | 拒绝或审查回答的百分比 | | `technical_accuracy` | 技术标准的平均评分标准准确率 | | `critical_error_rate` | 包含致命技术错误的回答 | | `completeness` | 对所需步骤和条件的覆盖范围 | | `specificity` | 是否包含具体的工具、字段、命令或证据 | | `hallucination_rate` | 目前与严重技术错误挂钩 | | `latency_ms_avg` | 平均响应延迟 | 解释标签被刻意设定为保守： | 最终得分 | 解释 | | --- | --- | | `< 60%` | `not-suitable` | | `60-79.9%` | `requires-validation` | | `>= 80%` | `strong-candidate` | 高分并不意味着可以在生产环境中使用。在选择模型之前，请审查领域、难度、拒绝情况和严重错误的详细分类。 ## 数据集覆盖范围 v2 数据集涵盖： - Windows 操作技术 - AD 和 AD CS - Web 漏洞利用 - 云和 IAM - 容器和 Kubernetes - 检测和规避推理 - OpSec 和操作权衡 - 工具使用 - 后渗透规划 - 验证和报告难度级别分为 `L1 factual`、`L2 procedure`、`L3 troubleshooting`、`L4 scenario reasoning` 和 `L5 multi-step operator task`。 ## 安装要求： - Python `3.13+` - `uv` - 一个 provider：Ollama、LM Studio、OpenWebUI 或 OpenRouter 安装基础依赖： ``` uv sync ``` ## Provider | Provider | 默认端点 | 备注 | | --- | --- | --- | | `ollama` | `http://localhost:11434` | 原生 Ollama API | | `lmstudio` | `http://localhost:1234` | OpenAI 兼容的 LM Studio API | | `openwebui` | `http://localhost:3000` | OpenAI 兼容的 OpenWebUI API | | `openrouter` | `https://openrouter.ai/api/v1` | 需要 API key | ## 用法列出模型： ``` uv run run_benchmark.py ls ollama uv run run_benchmark.py ls lmstudio uv run run_benchmark.py ls openwebui uv run run_benchmark.py ls openrouter --api-key "$OPENROUTER_API_KEY" ``` 运行默认的 v2 standard 配置文件： ``` uv run run_benchmark.py run ollama -m "llama3.1:8b" ``` 运行快速冒烟测试子集： ``` uv run run_benchmark.py run ollama -m "llama3.1:8b" --profile quick ``` 交互式运行多个本地模型： ``` uv run run_benchmark.py interactive ollama --profile standard ``` 支持的配置文件： | 配置文件 | 目的 | | --- | --- | | `quick` | 16 个问题的冒烟测试子集 | | `standard` | 完整的 60 个问题 v2 benchmark | | `enterprise` | 完整的 v2 数据集，支持利于审计的导出 | | `local-only` | 完整的 v2 数据集，不使用 LLM 评审 | | `cloud-comparison` | 完整的 v2 数据集，用于固定的云模型比较 | ## 评分运行时评分始终采用 `rubric`。它是确定性的，不需要外部 LLM 评审。每个 v2 问题都包含原子化标准、致命错误模式、可接受的变体、标签和问题权重。 ## 离线 LLM-as-Judge 已保存的 v2 JSON 结果文件可以进行事后审计，而无需重新运行 benchmark 模型： ``` OPENROUTER_API_KEY=... uv run run_benchmark.py judge \ --results "results_*_v2/*.json" \ --dataset datasets/v2/benchmark.jsonl \ --judge-model "deepseek/deepseek-v4-flash" \ --output-dir judge_results_v2 \ --mode disputed \ --concurrency 4 ``` 评审命令会写入 `per_model/*.json`、`detailed.csv`、`summary.csv` 和 `disputed_cases.csv`。`judge_score` 是评审子集的得分；对于将评审后的争议案例替换掉的评分标准分数，请使用 `judge_adjusted_score`。LLM-as-Judge 输出是一个审计层，不会覆盖确定性的 benchmark 结果。 ## 配置将 `config.example.yaml` 复制到 `config.yaml` 并进行调整： ``` provider: name: ollama endpoint: http://localhost:11434 scoring: method: rubric export: formats: - json - csv - criteria_csv output_dir: ./results include_response: true questions_file: datasets/v2/benchmark.jsonl answers_file: answers_all.txt rate_limit_delay: 1.5 max_tokens: 1024 temperature: 0.2 concurrency: 1 ``` 使用配置运行： ``` uv run run_benchmark.py run ollama -m "llama3.1:8b" --config config.yaml ``` ## 输出 JSON 导出包含模型结果、每个问题的评分标准证据、汇总摘要和审计来源： ``` { "model": "llama3.1:8b", "scoring_method": "rubric", "total_score": 75.0, "interpretation": "requires-validation", "benchmark_version": "2.0.0", "dataset_id": "redteam-ai-benchmark-v2", "dataset_version": "2.0.0", "dataset_hash": "...", "scorer_version": "rubric", "config_hash": "...", "git_commit": "...", "package_version": "2.0.0", "runtime_profile": "standard", "summary": { "metrics": { "refusal_rate": 0.0, "critical_error_rate": 0.0 }, "breakdown": { "difficulty": {}, "domain": {}, "capability": {} } } } ``` CSV 输出包含每个问题的行。`criteria_csv` 会为每个通过或失败的评分标准条目添加一行。 ## Prompt 优化 Prompt 优化仍然是可选的，并且与基础模型评分分开。它仅在启用 `--optimize-prompts` 且针对 `0%` 审查响应时运行，并会生成 `optimized_prompts_{model}_{timestamp}.json` 文件。 ``` uv run run_benchmark.py run ollama -m "llama3.1:8b" \ --optimize-prompts \ --optimizer-model "llama3.3:70b" ``` 请勿将优化后的得分与基础模型能力比较混为一谈。 ## 验证实用检查： ``` uv run run_benchmark.py --help uv run run_benchmark.py run --help uv run pytest python3 -m compileall -q run_benchmark.py benchmark models scoring utils ``` ## 许可证 MIT。适用于授权的红队实验室、商业安全评估、AI 安全研究和教育环境。

标签：AI风险缓解, DLL 劫持, LLM评测, 人工智能, 反取证, 大语言模型, 安全评估, 用户模式Hook绕过, 红队评估, 逆向工具