openai/simple-evals
GitHub: openai/simple-evals
OpenAI 开源的轻量级语言模型评估库,通过标准化的零样本思维链测试在多项核心基准上对主流模型进行透明评测。
Stars: 4524 | Forks: 491
# ⚠️ 弃用通知
**2025年7月**:`simple-evals` 将不再针对新模型或基准测试结果进行更新。本仓库将继续托管 **HealthBench**、**BrowseComp** 和 **SimpleQA** 的参考实现。
# 概述
该仓库包含一个用于评估语言模型的轻量级库。
我们将其开源,以便对我们与最新模型一同发布的准确率数据保持透明。
## 基准测试结果
| 模型 | Prompt | MMLU | GPQA [^8] | MATH [^6]| HumanEval | MGSM[^5] | DROP[^5]
(F1, 3-shot) | SimpleQA |:----------------------------:|:-------------:|:------:|:------:|:--------:|:---------:|:------:|:--------------------------:|:---------:| | **o3** | | | | | | | | | | | o3-high [^10] | n/a [^7] | 93.3 | 83.4 | 98.1 | 88.4 | 92.0 | 89.8 | 48.6 | | o3 [^9] [^10] | n/a | 92.9 | 82.8 | 97.8 | 87.4 | 92.3 | 80.6 | 49.4 | | o3-low [^10] | n/a | 92.8 | 78.6 | 96.9 | 87.3 | 91.9 | 82.3 | 49.4 | | **o4-mini** | | | | | | | | | | o4-mini-high [^9] [^10] | n/a | 90.3 | 81.3 | 98.2 | 99.3 | 93.5 | 78.1 | 19.3 | | o4-mini [^9] [^10] | n/a | 90.0 | 77.6 | 97.5 | 97.3 | 93.7 | 77.7 | 20.2 | | o4-mini-low [^10] | n/a | 89.5 | 73.6 | 96.2 | 95.9 | 93.0 | 76.0 | 20.2 | | **o3-mini** | | | | | | | | | | | o3-mini-high | n/a | 86.9 | 77.2 | 97.9 | 97.6 | 92.0 | 80.6 | 13.8 | | o3-mini | n/a | 85.9 | 74.9 | 97.3 | 96.3 | 90.8 | 79.2 | 13.4 | | o3-mini-low | n/a | 84.9 | 67.6 | 95.8 | 94.5 | 89.4 | 77.6 | 13.0 | | **o1** | | | | | | | | | | o1 | n/a | 91.8 | 75.7 | 96.4 | - | 89.3 | 90.2 | 42.6 | | o1-preview | n/a | 90.8 | 73.3 | 85.5 | 92.4 | 90.8 | 74.8 | 42.4 | | o1-mini | n/a | 85.2 | 60.0 | 90.0 | 92.4 | 89.9 | 83.9 | 07.6 | | **GPT-4.1** | | | | | | | | | | | gpt-4.1-2025-04-14 | assistant [^2]| 90.2 | 66.3 | 82.1 | 94.5 | 86.9 | 79.4 | 41.6 | | gpt-4.1-mini-2025-04-14 | assistant | 87.5 | 65.0 | 81.4 | 93.8 | 88.2 | 81.0 | 16.8 | | gpt-4.1-nano-2025-04-14 | assistant | 80.1 | 50.3 | 62.3 | 87.0 | 73.0 | 82.2 | 07.6 | | **GPT-4o** | | | | | | | | | | | gpt-4o-2024-11-20 | assistant | 85.7 | 46.0 | 68.5 | 90.2 | 90.3 | 81.5 | 38.8 | | gpt-4o-2024-08-06 | assistant | 88.7 | 53.1 | 75.9 | 90.2 | 90.0 | 79.8 | 40.1 | | gpt-4o-2024-05-13 | assistant | 87.2 | 49.9 | 76.6 | 91.0 | 89.9 | 83.7 | 39.0 | | gpt-4o-mini-2024-07-18 | assistant | 82.0 | 40.2 | 70.2 | 87.2 | 87.0 | 79.7 | 09.5 | | **GPT-4.5-preview** | | | | | | | | | | gpt-4.5-preview-2025-02-27 | assistant | 90.8 | 69.5 | 87.1 | 88.6 | 86.9 | 83.4 | 62.5 | | **GPT-4 Turbo 和 GPT-4** | | | | | | | | | | gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 | 86.0 | 24.2 | | gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 | n/a | | gpt-4-1106-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 | n/a | | **其他模型(已公布结果)** | | | | | | | | | [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) | 未知 | 88.3 | 59.4 | 71.1 | 92.0 | 91.6 | 87.1 | 28.9 | | [Claude 3 Opus](https://www.anthropic.com/news/claude-3-family) | 未知 | 86.8 | 50.4 | 60.1 | 84.9 | 90.7 | 83.1 | 23.5 | | [Llama 3.1 405b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | 未知 | 88.6 | 50.7 | 73.8 | 89.0 | 91.6 | 84.8 | n/a | [Llama 3.1 70b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | 未知 | 82.0 | 41.7 | 68.0 | 80.5 | 86.9 | 79.6 | n/a | [Llama 3.1 8b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | 未知 | 68.4 | 30.4 | 51.9 | 72.6 | 68.9 | 59.5 | n/a | [Grok 2](https://x.ai/blog/grok-2) | 未知 | 87.5 | 56.0 | 76.1 | 88.4 | n/a | n/a | n/a | [Grok 2 mini](https://x.ai/blog/grok-2) | 未知 | 86.2 | 51.0 | 73.0 | 85.7 | n/a | n/a | n/a | [Gemini 1.0 Ultra](https://goo.gle/GeminiV1-5) | 未知 | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 | n/a | [Gemini 1.5 Pro](https://goo.gle/GeminiV1-5) | 未知 | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 | n/a | [Gemini 1.5 Flash](https://goo.gle/GeminiV1-5) | 未知 | 77.9 | 38.6 | 40.9 | 71.5 | 75.5 | 78.4 | n/a ## 背景 评估对 Prompt 非常敏感,并且在近期的出版物和库中使用的表述存在显著差异。 有些使用 few-shot Prompt 或角色扮演 Prompt(“你是一位专业的软件程序员……”)。 这些方法是评估*基础模型*(而不是经过指令/对话微调的模型)以及那些在遵循指令方面表现较差的模型时遗留下来的做法。 对于这个库,我们强调使用简单的指令(如“解决以下多项选择题”)进行 *zero-shot, chain-of-thought*(零样本,思维链)设置。我们相信这种 Prompt 技术能更好地反映模型在实际使用中的表现。 **我们将不再积极维护此仓库,也不会监控 PR 和 Issue。** 特别是,我们不接受新的评估。以下是我们可能会接受的更改。 - Bug 修复(希望不需要!) - 为新模型添加适配器 - 在新模型和新 system prompt 的基础上,在下表中添加带有评估结果的新行。 此仓库并不旨在替代 https://github.com/openai/evals,后者的目标是收集大量评估的综合性合集。 ## 评估 此仓库目前包含以下评估: - MMLU:Measuring Massive Multitask Language Understanding,参考链接:https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test, [MIT License](https://github.com/hendrycks/test/blob/master/LICENSE) - MATH:Measuring Mathematical Problem Solving With the MATH Dataset,参考链接:https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math, [MIT License](https://github.com/idavidrein/gpqa/blob/main/LICENSE) - GPQA:A Graduate-Level Google-Proof Q&A Benchmark,参考链接:https://arxiv.org/abs/2311.12022, https://github.com/idavidrein/gpqa/, [MIT License](https://github.com/idavidrein/gpqa/blob/main/LICENSE) - DROP:A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs,参考链接:httpsarxiv.org/abs/1903.00161, https://allenai.org/data/drop, [Apache License 2.0](https://github.com/allenai/allennlp-models/blob/main/LICENSE) - MGSM:Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners,参考链接:https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp, [Creative Commons Attribution 4.0 International Public License (CC-BY)](https://github.com/google-research/url-nlp/blob/main/LICENSE) - HumanEval:Evaluating Large Language Models Trained on Code,参考链接 https://arxiv.org/abs/2107.03374, https://github.com/openai/human-eval, [MIT License](https://github.com/openai/human-eval/blob/master/LICENSE) - SimpleQA:Measuring short-form factuality in large language models,参考链接:https://openai.com/index/introducing-simpleqa, [MIT License](https://github.com/openai/simple-evals/blob/main/LICENSE) - BrowseComp:A Simple Yet Challenging Benchmark for Browsing Agents,参考链接:https://openai.com/index/browsecomp, [MIT License](https://github.com/openai/simple-evals/blob/main/LICENSE) - HealthBench:Evaluating Large Language Models Towards Improved Human Health,参考链接:https://openai.com/index/healthbench, [MIT License](https://github.com/openai/simple-evals/blob/main/LICENSE) ## 采样器 我们为以下语言模型 API 实现了采样接口: - OpenAI: https://platform.openai.com/docs/overview - Claude: https://www.anthropic.com/api 在使用这些 API 之前,请确保设置了 `*_API_KEY` 环境变量。 ## 安装 由于存在可选依赖项,我们不提供统一的安装机制。相反,我们为每个评估和采样器提供了说明。 对于 [HumanEval](https://github.com/openai/human-eval/)(Python 编程) ``` git clone https://github.com/openai/human-eval pip install -e human-eval ``` 对于 [OpenAI API](https://pypi.org/project/openai/): ``` pip install openai ``` 对于 [Anthropic API](https://docs.anthropic.com/claude/docs/quickstart-guide): ``` pip install anthropic ``` ## 运行评估 ``` python -m simple-evals.simple_evals --list-models ``` 这将列出所有您可以评估的模型。 要运行评估,您可以使用以下命令: ``` python -m simple-evals.simple_evals --model --examples
```
这将通过 OpenAI API 启动评估。
## 备注
[^1]:chatgpt system message: "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.\nKnowledge cutoff: 2023-12\nCurrent date: 2024-04-01"
[^2]:assistant system message in [OpenAI API doc](https://platform.openai.com/docs/api-reference/introduction): "You are a helpful assistant." .
[^3]:claude-3 empty system message: suggested by Anthropic API doc, and we have done limited experiments due to [rate limit](https://docs.anthropic.com/claude/reference/rate-limits) issues, but we welcome PRs with alternative choices.
[^4]:claude-3 lmsys system message: system message in LMSYS [Fast-chat open source code](https://github.com/lm-sys/FastChat/blob/7899355ebe32117fdae83985cf8ee476d2f4243f/fastchat/conversation.py#L894): "The assistant is Claude, created by Anthropic. The current date is {{currentDateTime}}. Claude's knowledge base was last updated ... ". We have done limited experiments due to [rate limit](https://docs.anthropic.com/claude/reference/rate-limits) issues, but we welcome PRs with alternative choices.
[^5]:We believe these evals are saturated for our newer models, but are reporting them for completeness.
[^6]:For newer models (anything on or after o1) we evaluate on [MATH-500](https://github.com/openai/prm800k/tree/main/prm800k/math_splits), which is a newer, IID version of MATH.
[^7]:o-series models do not support using a system prompt.
[^8]:Includes an answer regex tweak for GPQA benchmark.
[^9]:The default reasoning level for o3-mini is "medium".
[^10]:These results are with no tools enabled for o3 or o4-mini
## 法律声明
通过为评估做出贡献,您同意将您的评估逻辑和数据置于与本仓库相同的 MIT 许可协议下。您必须拥有上传评估中使用的任何数据的充分权利。OpenAI 保留在我们的产品中将此数据用于未来服务改进的权利。对 OpenAI evals 的贡献将受我们通常的使用政策约束:https://platform.openai.com/docs/usage-policies。
(F1, 3-shot) | SimpleQA |:----------------------------:|:-------------:|:------:|:------:|:--------:|:---------:|:------:|:--------------------------:|:---------:| | **o3** | | | | | | | | | | | o3-high [^10] | n/a [^7] | 93.3 | 83.4 | 98.1 | 88.4 | 92.0 | 89.8 | 48.6 | | o3 [^9] [^10] | n/a | 92.9 | 82.8 | 97.8 | 87.4 | 92.3 | 80.6 | 49.4 | | o3-low [^10] | n/a | 92.8 | 78.6 | 96.9 | 87.3 | 91.9 | 82.3 | 49.4 | | **o4-mini** | | | | | | | | | | o4-mini-high [^9] [^10] | n/a | 90.3 | 81.3 | 98.2 | 99.3 | 93.5 | 78.1 | 19.3 | | o4-mini [^9] [^10] | n/a | 90.0 | 77.6 | 97.5 | 97.3 | 93.7 | 77.7 | 20.2 | | o4-mini-low [^10] | n/a | 89.5 | 73.6 | 96.2 | 95.9 | 93.0 | 76.0 | 20.2 | | **o3-mini** | | | | | | | | | | | o3-mini-high | n/a | 86.9 | 77.2 | 97.9 | 97.6 | 92.0 | 80.6 | 13.8 | | o3-mini | n/a | 85.9 | 74.9 | 97.3 | 96.3 | 90.8 | 79.2 | 13.4 | | o3-mini-low | n/a | 84.9 | 67.6 | 95.8 | 94.5 | 89.4 | 77.6 | 13.0 | | **o1** | | | | | | | | | | o1 | n/a | 91.8 | 75.7 | 96.4 | - | 89.3 | 90.2 | 42.6 | | o1-preview | n/a | 90.8 | 73.3 | 85.5 | 92.4 | 90.8 | 74.8 | 42.4 | | o1-mini | n/a | 85.2 | 60.0 | 90.0 | 92.4 | 89.9 | 83.9 | 07.6 | | **GPT-4.1** | | | | | | | | | | | gpt-4.1-2025-04-14 | assistant [^2]| 90.2 | 66.3 | 82.1 | 94.5 | 86.9 | 79.4 | 41.6 | | gpt-4.1-mini-2025-04-14 | assistant | 87.5 | 65.0 | 81.4 | 93.8 | 88.2 | 81.0 | 16.8 | | gpt-4.1-nano-2025-04-14 | assistant | 80.1 | 50.3 | 62.3 | 87.0 | 73.0 | 82.2 | 07.6 | | **GPT-4o** | | | | | | | | | | | gpt-4o-2024-11-20 | assistant | 85.7 | 46.0 | 68.5 | 90.2 | 90.3 | 81.5 | 38.8 | | gpt-4o-2024-08-06 | assistant | 88.7 | 53.1 | 75.9 | 90.2 | 90.0 | 79.8 | 40.1 | | gpt-4o-2024-05-13 | assistant | 87.2 | 49.9 | 76.6 | 91.0 | 89.9 | 83.7 | 39.0 | | gpt-4o-mini-2024-07-18 | assistant | 82.0 | 40.2 | 70.2 | 87.2 | 87.0 | 79.7 | 09.5 | | **GPT-4.5-preview** | | | | | | | | | | gpt-4.5-preview-2025-02-27 | assistant | 90.8 | 69.5 | 87.1 | 88.6 | 86.9 | 83.4 | 62.5 | | **GPT-4 Turbo 和 GPT-4** | | | | | | | | | | gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 | 86.0 | 24.2 | | gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 | n/a | | gpt-4-1106-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 | n/a | | **其他模型(已公布结果)** | | | | | | | | | [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) | 未知 | 88.3 | 59.4 | 71.1 | 92.0 | 91.6 | 87.1 | 28.9 | | [Claude 3 Opus](https://www.anthropic.com/news/claude-3-family) | 未知 | 86.8 | 50.4 | 60.1 | 84.9 | 90.7 | 83.1 | 23.5 | | [Llama 3.1 405b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | 未知 | 88.6 | 50.7 | 73.8 | 89.0 | 91.6 | 84.8 | n/a | [Llama 3.1 70b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | 未知 | 82.0 | 41.7 | 68.0 | 80.5 | 86.9 | 79.6 | n/a | [Llama 3.1 8b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | 未知 | 68.4 | 30.4 | 51.9 | 72.6 | 68.9 | 59.5 | n/a | [Grok 2](https://x.ai/blog/grok-2) | 未知 | 87.5 | 56.0 | 76.1 | 88.4 | n/a | n/a | n/a | [Grok 2 mini](https://x.ai/blog/grok-2) | 未知 | 86.2 | 51.0 | 73.0 | 85.7 | n/a | n/a | n/a | [Gemini 1.0 Ultra](https://goo.gle/GeminiV1-5) | 未知 | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 | n/a | [Gemini 1.5 Pro](https://goo.gle/GeminiV1-5) | 未知 | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 | n/a | [Gemini 1.5 Flash](https://goo.gle/GeminiV1-5) | 未知 | 77.9 | 38.6 | 40.9 | 71.5 | 75.5 | 78.4 | n/a ## 背景 评估对 Prompt 非常敏感,并且在近期的出版物和库中使用的表述存在显著差异。 有些使用 few-shot Prompt 或角色扮演 Prompt(“你是一位专业的软件程序员……”)。 这些方法是评估*基础模型*(而不是经过指令/对话微调的模型)以及那些在遵循指令方面表现较差的模型时遗留下来的做法。 对于这个库,我们强调使用简单的指令(如“解决以下多项选择题”)进行 *zero-shot, chain-of-thought*(零样本,思维链)设置。我们相信这种 Prompt 技术能更好地反映模型在实际使用中的表现。 **我们将不再积极维护此仓库,也不会监控 PR 和 Issue。** 特别是,我们不接受新的评估。以下是我们可能会接受的更改。 - Bug 修复(希望不需要!) - 为新模型添加适配器 - 在新模型和新 system prompt 的基础上,在下表中添加带有评估结果的新行。 此仓库并不旨在替代 https://github.com/openai/evals,后者的目标是收集大量评估的综合性合集。 ## 评估 此仓库目前包含以下评估: - MMLU:Measuring Massive Multitask Language Understanding,参考链接:https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test, [MIT License](https://github.com/hendrycks/test/blob/master/LICENSE) - MATH:Measuring Mathematical Problem Solving With the MATH Dataset,参考链接:https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math, [MIT License](https://github.com/idavidrein/gpqa/blob/main/LICENSE) - GPQA:A Graduate-Level Google-Proof Q&A Benchmark,参考链接:https://arxiv.org/abs/2311.12022, https://github.com/idavidrein/gpqa/, [MIT License](https://github.com/idavidrein/gpqa/blob/main/LICENSE) - DROP:A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs,参考链接:httpsarxiv.org/abs/1903.00161, https://allenai.org/data/drop, [Apache License 2.0](https://github.com/allenai/allennlp-models/blob/main/LICENSE) - MGSM:Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners,参考链接:https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp, [Creative Commons Attribution 4.0 International Public License (CC-BY)](https://github.com/google-research/url-nlp/blob/main/LICENSE) - HumanEval:Evaluating Large Language Models Trained on Code,参考链接 https://arxiv.org/abs/2107.03374, https://github.com/openai/human-eval, [MIT License](https://github.com/openai/human-eval/blob/master/LICENSE) - SimpleQA:Measuring short-form factuality in large language models,参考链接:https://openai.com/index/introducing-simpleqa, [MIT License](https://github.com/openai/simple-evals/blob/main/LICENSE) - BrowseComp:A Simple Yet Challenging Benchmark for Browsing Agents,参考链接:https://openai.com/index/browsecomp, [MIT License](https://github.com/openai/simple-evals/blob/main/LICENSE) - HealthBench:Evaluating Large Language Models Towards Improved Human Health,参考链接:https://openai.com/index/healthbench, [MIT License](https://github.com/openai/simple-evals/blob/main/LICENSE) ## 采样器 我们为以下语言模型 API 实现了采样接口: - OpenAI: https://platform.openai.com/docs/overview - Claude: https://www.anthropic.com/api 在使用这些 API 之前,请确保设置了 `*_API_KEY` 环境变量。 ## 安装 由于存在可选依赖项,我们不提供统一的安装机制。相反,我们为每个评估和采样器提供了说明。 对于 [HumanEval](https://github.com/openai/human-eval/)(Python 编程) ``` git clone https://github.com/openai/human-eval pip install -e human-eval ``` 对于 [OpenAI API](https://pypi.org/project/openai/): ``` pip install openai ``` 对于 [Anthropic API](https://docs.anthropic.com/claude/docs/quickstart-guide): ``` pip install anthropic ``` ## 运行评估 ``` python -m simple-evals.simple_evals --list-models ``` 这将列出所有您可以评估的模型。 要运行评估,您可以使用以下命令: ``` python -m simple-evals.simple_evals --model
标签:DLL 劫持, Petitpotam, Python, 人工智能, 大语言模型, 无后门, 模型评估, 用户模式Hook绕过, 逆向工具