openai/mle-bench

GitHub: openai/mle-bench

OpenAI 推出的机器学习工程基准测试项目，用于量化评估 AI 智能体在真实 Kaggle 竞赛任务中的 ML 工程实践能力。

Stars: 1659 | Forks: 258

# MLE-bench 论文 ["MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering"](https://arxiv.org/abs/2410.07095) 的相关代码。我们已经发布了用于构建数据集的代码、评估逻辑，以及在此基准测试中评估过的 agent。 ## 排行榜 *更新* (04-24-2026)：我们目前不再接受排行榜的新提交，同时我们正在开发一个改进的流程，以确保提交的公平性和可比性。我们将在未来分享关于此流程的更新。 | Agent | 使用的 LLM | Low == Lite (%) | Medium (%) | High (%) | All (%) | 运行时间 (小时) | 日期 | 源代码可用 | 评分报告可用 | |-------|-------------|-----------------|------------|----------|---------|----------------------|------|----------------------|---------------------------| | [Famou-Agent 2.0](https://github.com/baidubce/FM-Agent) | Gemini-3-Pro-Preview | 80.3 ± 1.52 | 64.04 ± 2.32 | 42.22 ± 2.22 | 64.44 ± 1.18 | 24 | 2026-02-23 | X | ✓ | | [AIBuildAI](https://github.com/aibuildai/AI-Build-AI) | Claude-Opus-4.6 | 77.27 ± 0.00 | 61.40 ± 0.88 | 46.67 ± 0.00 | 63.11 ± 0.44 | 24 | 2026-03-06 | X | ✓ | | [CAIR](https://research.google/teams/cloud-ai-research/) [MARS+](https://arxiv.org/pdf/2602.02660) | Gemini-3-Pro-Preview | 78.79 ± 1.52 | 60.53 ± 1.52 | 44.44 ± 2.22 | 62.67 ± 0.77 | 24 | 2026-02-17 | X | ✓ | | [MLEvolve](https://github.com/InternScience/MLEvolve) | Gemini-3-Pro-Preview | 80.30 ± 1.52 | 57.89 ± 1.52 | 42.22 ± 2.22 | 61.33 ± 1.33 | 12 | 2026-02-14 | ✓ | ✓ | | [PiEvolve](https://github.com/FractalAIResearchLabs/PiEvolve)
(Fractal AI Research) | Gemini-3-Pro-Preview[^4] | 80.30 ± 1.52[^3] | 58.77 ± 0.88[^3] | 40.0 ± 0.00[^3] | 61.33 ± 0.77[^3] | 24 | 2026-01-05 | X | ✓ | | [Famou-Agent 2.0](https://github.com/baidubce/FM-Agent) | Gemini-2.5-Pro | 75.76 ± 1.52 | 57.89 ± 1.52 | 40.00 ± 0.00 | 59.56 ± 0.89 | 24 | 2025-12-27 | X | ✓ | | [ML-Master 2.0](https://github.com/sjtu-sai-agents/ML-Master) | Deepseek-V3.2-Speciale | 75.76 ± 1.51 | 50.88 ± 3.51 | 42.22 ± 2.22 | 56.44 ± 2.47 | 24 | 2025-12-16 | X | ✓ | | [CAIR](https://research.google/teams/cloud-ai-research/) [MARS](https://arxiv.org/pdf/2602.02660) | Gemini-3-Pro-Preview | 74.24 ± 1.52 | 52.63 ± 3.04 | 37.78 ± 2.22 | 56.0 ± 1.54 | 24 | 2026-01-25 | X | ✓ | | [PiEvolve](https://github.com/FractalAIResearchLabs/PiEvolve)
(Fractal AI Research) | Gemini-3-Pro-Preview[^4] | 74.24 ± 3.03[^3] | 45.61 ± 0.88[^3] | 35.55 ± 2.22[^3] | 52.0 ± 0.77[^3] | 12 | 2026-01-05 | X | ✓ | | [Leeroo](https://github.com/Leeroo-AI/kapso) | Gemini-3-Pro-Preview[^4] | 68.18 ± 2.62[^3] | 44.74 ± 1.52[^3] | 40.00 ± 0.00[^3] | 50.67 ± 1.33[^3] | 24 | 2025-12-07 | ✓ | ✓ | | [Thesis](https://thesislabs.ai) | gpt-5-codex | 65.15 ± 1.52 | 45.61 ± 7.18 | 31.11 ± 2.22 | 48.44 ± 3.64 | 24 | 2025-11-10 | X | ✓ | | [CAIR](https://research.google/teams/cloud-ai-research/) MLE-STAR-Pro-1.5 | Gemini-2.5-Pro | 68.18 ± 2.62 | 34.21 ± 1.52 | 33.33 ± 0.00 | 44.00 ± 1.33 | 24 | 2025-11-25 | X | ✓ | | [Famou-Agent](https://github.com/baidubce/FM-Agent) | Gemini-2.5-Pro | 62.12 ± 1.52 | 36.84 ± 1.52 | 33.33 ± 0.00 | 43.56 ± 0.89 | 24 | 2025-10-10 | X | ✓ | | [Operand](https://operand.com) ensemble | gpt-5 (low verbosity/effort)[^2] | 63.64 ± 0.00 | 33.33 ± 0.88[^3] | 20.00 ± 0.00[^3] | 39.56 ± 0.44[^3] | 24 | 2025-10-06 | X | ✓ | | [CAIR](https://research.google/teams/cloud-ai-research/) MLE-STAR-Pro-1.0 | Gemini-2.5-Pro | 66.67 ± 1.52 | 25.44 ± 0.88 | 31.11 ± 2.22 | 38.67 ± 0.77 | 12 | 2025-11-03 | X | ✓ | | [InternAgent](https://github.com/Alpha-Innovator/InternAgent/) | deepseek-r1 | 62.12 ± 3.03 | 26.32 ± 2.63 | 24.44 ± 2.22 | 36.44 ± 1.18 | 12 | 2025-09-12 | X | ✓ | | [R&D-Agent](https://github.com/microsoft/RD-Agent) | gpt-5 | 68.18 ± 2.62 | 21.05 ± 1.52 | 22.22 ± 2.22 | 35.11 ± 0.44 | 12 | 2025-09-26 | ✓ | ✓ | | [Neo](https://heyneo.so/) multi-agent | undisclosed | 48.48 ± 1.52 | 29.82 ± 2.32 | 24.44 ± 2.22 | 34.22 ± 0.89 | 36 | 2025-07-28 | X | ✓ | | [AIRA-dojo](https://github.com/facebookresearch/aira-dojo/) | o3 | 55.00 ± 1.47 | 21.97 ± 1.17 | 21.67 ± 1.07 | 31.60 ± 0.82 | 24 | 2025-05-15 | ✓ | ✓ | | [R&D-Agent](https://github.com/microsoft/RD-Agent) | o3 + GPT-4.1 | 51.52 ± 4.01 | 19.30 ± 3.16 | 26.67 ± 0.00 | 30.22 ± 0.89 | 24 | 2025-08-15 | ✓ | ✓ | | [ML-Master](https://github.com/zeroxleo/ML-Master) | deepseek-r1 | 48.48 ± 1.52 | 20.18 ± 2.32 | 24.44 ± 2.22 | 29.33 ± 0.77 | 12 | 2025-06-17 | ✓ | ✓ | | [R&D-Agent](https://github.com/microsoft/RD-Agent) | o1-preview | 48.18 ± 1.11 | 8.95 ± 1.05 | 18.67 ± 1.33 | 22.40 ± 0.50 | 24 | 2025-05-14 | ✓ | ✓ | | [AIDE](https://github.com/wecoai/aideml) | o1-preview | 35.91 ± 1.86 | 8.45 ± 0.43 | 11.67 ± 1.27 | 17.12 ± 0.61 | 24 | 2024-10-08 | ✓ | ✓ | | [AIDE](https://github.com/wecoai/aideml) | gpt-4o-2024-08-06 | 18.55 ± 1.26 | 3.06 ± 0.33 | 8.15 ± 0.84 | 8.63 ± 0.54 | 24 | 2024-10-08 | ✓ | ✓ | | [AIDE](https://github.com/wecoai/aideml) | claude-3-5-sonnet-20240620 | 19.70 ± 1.52 | 2.63 ± 1.52 | 2.22 ± 2.22 | 7.56 ± 1.60 | 24 | 2024-10-08 | ✓ | ✓ | | OpenHands | gpt-4o-2024-08-06 | 12.12 ± 1.52 | 1.75 ± 0.88 | 2.22 ± 2.22 | 4.89 ± 0.44 | 24 | 2024-10-08 | ✓ | ✓ | | [AIDE](https://github.com/wecoai/aideml) | llama-3.1-405b-instruct | 10.23 ± 1.14 | 0.66 ± 0.66 | 0.00 ± 0.00 | 3.33 ± 0.38 | 24 | 2024-10-08 | ✓ | ✓ | | MLAB | gpt-4o-2024-08-06 | 4.55 ± 0.86 | 0.00 ± 0.00 | 0.00 ± 0.00 | 1.60 ± 0.27 | 24 | 2024-10-08 | ✓ | ✓ | ### 额外排行榜提交与主排行榜无法直接比较的额外提交（参见 `Notes` 列）。 | Agent | 使用的 LLM | Low == Lite (%) | Medium (%) | High (%) | All (%) | 运行时间 (小时) | 日期 | 备注 | 源代码可用 | 评分报告可用 | |-------|-------------|-----------------|------------|----------|---------|----------------------|------|-------|----------------------|---------------------------| | [Disarray](https://disarray.ai) | Ensemble (Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5.2-Codex, Gemini-3-Pro-Preview) | 90.91 ± 0.00 | 72.81 ± 0.88 | 71.11 ± 2.22 | 77.78 ± 0.44 | 24 | 2026-02-03 | [测试集反馈](https://github.com/openai/mle-bench/pull/118) | X | ✓ | | [LoongFlow](https://github.com/baidu-baige/LoongFlow) | Gemini-3-Flash-Preview | 77.27 ± 0.0[^3] | 63.15 ± 1.51[^3] | 40.0 ± 0.00[^3] | 62.66 ± 0.76[^3] | 24 | 2026-02-09 | [测试集反馈](https://github.com/openai/mle-bench/pull/119) | ✓ | ✓ | [^2]: 包含由 Gemini-2.5-Pro 蒸馏的模型集成的一些辅助，其中包括 Gemini-2.5-Pro, Grok-4 和 Claude 4.1 Opus。 [^3]: 通过用失败分数填充不完整的随机种子来计算。 [^4]: 架构主要由 Gemini-3-Pro-Preview 驱动，部分模块使用 GPT-5 和 GPT-5-mini。 ### 生成排行榜分数要生成排行榜分数，请整理您的评分报告，放在按运行组组织的 `runs/` 文件夹中，每个运行组对应一份评分报告。在 `runs/run_group_experiments.csv` 中使用实验 ID 确定您提交的运行组。然后运行 ``` uv run python experiments/aggregate_grading_reports.py --experiment-id --split low uv run python experiments/aggregate_grading_reports.py --experiment-id --split medium uv run python experiments/aggregate_grading_reports.py --experiment-id --split high uv run python experiments/aggregate_grading_reports.py --experiment-id --split split75 ``` 报告报告的 `any_medal_percentage` 指标下每个分割的平均值和平均值标准 (SEM)。`--split75` 标志对应于 `All (%)` 列。 ## 基准测试本节描述了用于比较 MLE-bench 分数的标准设置。我们建议如下： - 使用至少 3 个随机种子重复每次评估，并报告 Any Medal (%) 分数为平均值 ± 一个平均值标准误差。评估（任务和评分）本身是确定性的，但 agent/LLM 可能会有很大的方差！ - Agent 资源 - 这不是基准测试的严格要求，但如果您偏离了这些默认值，请注明！ - 运行时间：24 小时 - 算力：36 个 vCPU，440GB RAM 和一个 24GB A10 GPU - 包含您在低、中、高和所有复杂度[分割](experiments/splits)中的分数细分（有关这为何有用的说明，请参见下文的 *Lite 评估*）。 ### Lite 评估在 MLE-bench 完整的 75 项比赛中使用上述设置评估 agent 可能成本较高。对于偏好基准测试“精简”版本的用户，我们建议使用我们数据集的[低复杂度分割](https://github.com/openai/mle-bench/blob/main/experiments/splits/low.txt)，其中仅包含 22 项比赛。这大大减少了运行次数，同时仍允许在上表中一列进行公平比较。此外，低复杂度比赛通常要轻量得多（总数据集大小为 158GB，而完整集为 3.3TB），因此用户还可以考虑减少运行时间或 agent 的可用计算资源，以进一步降低成本。但是，请注意这样做可能会降低 agent 的性能。例如，请参阅[我们论文的第 3.3 和 3.4 节](https://arxiv.org/abs/2410.07095)，我们在其中试验了在完整比赛集上改变资源分配。 Lite 数据集包含以下比赛： | 比赛 ID | 类别 | 数据集大小 (GB) | |---------------------------------------------|----------------------------|--------------------| | aerial-cactus-identification | 图像分类 | 0.0254 | | aptos2019-blindness-detection | 图像分类 | 10.22 | | denoising-dirty-documents | 图像到图像 | 0.06 | | detecting-insults-in-social-commentary | 文本分类 | 0.002 | | dog-breed-identification | 图像分类 | 0.75 | | dogs-vs-cats-redux-kernels-edition | 图像分类 | 0.85 | | histopathologic-cancer-detection | 图像回归 | 7.76 | | jigsaw-toxic-comment-classification-challenge | 文本分类 | 0.06 | | leaf-classification | 图像分类 | 0.036 | | mlsp-2013-birds | 音频分类 | 0.5851 | | new-york-city-taxi-fare-prediction | 表格 | 5.7 | | nomad2018-predict-transparent-conductors | 表格 | 0.00624 | | plant-pathology-2020-fgvc7 | 图像分类 | 0.8 | | random-acts-of-pizza | 文本分类 | 0.003 | | ranzcr-clip-catheter-line-classification | 图像分类 | 13.13 | | siim-isic-melanoma-classification | 图像分类 | 116.16 | | spooky-author-identification | 文本分类 | 0.0019 | | tabular-playground-series-dec-2021 | 表格 | 0.7 | | tabular-playground-series-may-2022 | 表格 | 0.57 | | text-normalization-challenge-english-language | 序列->序列 | 0.01 | | text-normalization-challenge-russian-language | 序列->序列 | 0.01 | | the-icml-2013-whale-challenge-right-whale-redux | 音频分类 | 0.29314 | ## 设置一些 MLE-bench 比赛数据使用 [Git-LFS](https://git-lfs.com/) 存储。下载并安装 LFS 后，运行： ``` git lfs fetch --all git lfs pull ``` 您可以使用 pip 安装 `mlebench`： ``` pip install -e . ``` ### Pre-Commit Hooks (可选) 如果您要提交代码，可以通过运行以下命令来安装 pre-commit hooks： ``` pre-commit install ``` ## 数据集 MLE-bench 数据集是一个包含 75 项 Kaggle 比赛的集合，我们用它来评估 AI 系统的 ML 工程能力。由于 Kaggle 不提供每场比赛的保留测试集，因此我们提供了准备脚本，将公开可用的训练集划分为新的训练集和测试集。对于每场比赛，我们还提供了评分脚本，可用于评估提交的分数。我们使用 [Kaggle API](https://github.com/Kaggle/kaggle-api) 下载原始数据集。确保您已下载 Kaggle 凭据 (`kaggle.json`) 并将其放置在 `~/.kaggle/` 目录中（这是 Kaggle API 查找您凭据的默认位置）。要下载并准备 MLE-bench 数据集，请运行以下命令，这将在您系统的默认缓存目录中下载并准备数据集。请注意，我们发现从头开始运行需要两天时间： ``` mlebench prepare --all ``` 要准备 lite 数据集，请运行： ``` mlebench prepare --lite ``` 或者，您可以通过运行以下命令来准备特定比赛的数据集： ``` mlebench prepare -c ``` 运行 `mlebench prepare --help` 以查看可用比赛的列表。 ## 评分提交比赛的答案必须以 CSV 格式提交；所需格式在每场比赛的描述中进行了说明，或显示在比赛的示例提交文件中。您可以使用 `mlebench grade` 命令对多个提交进行评分。给定一个 JSONL 文件，其中每一行对应一项比赛的提交，`mlebench grade` 将为每场比赛生成一份评分报告。该 JSONL 文件必须包含以下字段： - `competition_id`：我们数据集中比赛的 ID。 - `submission_path`：一个包含指定比赛预测结果的 `.csv` 文件。运行 `mlebench grade --help` 以查看更多信息。您还可以使用 `mlebench grade-sample` 命令对单个提交进行评分。例如，要对 Spaceship Titanic 比赛的提交进行评分，您可以运行： ``` mlebench grade-sample spaceship-titanic ``` 运行 `mlebench grade-sample --help` 以查看更多信息。 ## 环境我们提供了一个基础的 Docker 镜像 `mlebench-env`，这是供我们的 agent 使用的基础环境。此基础镜像包含： - 用于执行我们的 agent 的 Conda 环境。我们（默认为 true）选择性地在此环境中安装我们 agent 中常用的 Python 包。如果您不想安装这些包，请在构建镜像时将 `INSTALL_HEAVY_DEPENDENCIES` 环境变量设置为 `false`，方法是将 `--build-arg INSTALL_HEAVY_DEPENDENCIES=false` 添加到下方的 `docker build` 命令中 - agent 在创建提交时应遵循的说明 - 供 agent 用来检查其提交结构是否正确的评分服务器通过运行以下命令构建此镜像： ``` docker build --platform=linux/amd64 -t mlebench-env -f environment/Dockerfile . ``` ## Agent 我们有意设计我们的基准测试，不对产生提交的 agent 做任何假设，以便 agent 可以更轻松地在此基准测试上进行评估。我们评估了三个开源的 agent；我们在 [agents/README.md](agents/README.md) 中讨论了此过程。 ## 附加功能我们在 MLE-bench 存储库中包含了可能对 MLE-bench 评估有用的附加功能。这些包括规则违规检测器和抄袭检测器。我们建议读者参阅 [extras/README.md](extras/README.md) 获取更多信息。 ## 示例我们将此库的示例用法收集在 `examples/` 目录中，有关更多信息，请参见 [examples/README.md](examples/README.md)。 ## 实验我们将基准测试出版物中特定于实验的代码放在 `experiments/` 目录中： - 例如，我们的比赛划分可在 `experiments/splits/` 中找到。 - 对于给定 agent 的完整运行集，您可以使用提供的 `experiments/make_submission.py` 脚本来编译其提交以进行评分。 - 我们在 `experiments/familiarity/` 中发布了“熟悉度”实验的方法，有关更多信息，请参见 [experiments/familiarity/README.md](experiments/familiarity/README.md)。 ## 开发请注意，在本地运行 `pytest` 时，请务必接受比赛规则，否则测试将失败。 ## 已知问题某些 MLE-bench 比赛存在一些已知问题。由于我们已经收到了排行榜提交，为了避免使排行榜失效，我们推迟了修复。相反，我们计划在即将发布的 [openai/frontier-evals](https://github.com/openai/frontier-evals) 存储库中的 MLE-bench v2 版本中发布批量修复，该版本将在排行榜中包含一个版本列，以区分 v1 和 v2 结果。如果您希望在此期间向 v1 提交，请仍然在您的总分中包含以下比赛。已知问题汇总如下： - **tensorflow2-question-answering**: - `grade.py` 中的 `validate_submission` 函数在此比赛上失败因为答案文件是 `test.jsonl` 而不是 `test.csv`。 [#134](https://github.com/openai/mle-bench/issues/134) - **tensorflow-speech-recognition-challenge**: - prepare.py 脚本错误地准备了测试集，导致测试标签的范围比应有的要大得多。 [#63](https://github.com/openai/mle-bench/issues/63) - prepare.py 脚本未能正确创建一个测试集，使得说话者 ID 与 train/val 中的说话者 ID 不相交。 - **icecube-neutrinos-in-deep-ice**: 校验和不匹配。 [#58](https://github.com/openai/mle-bench/issues/58) - **ranzcr-clip-catheter-line-classification**: prepare.py 脚本导致示例提交中缺少列。 [#30](https://github.com/openai/mle-bench/issues/30) - **dog-breed-identification**: MLE-bench 测试分割是通过从斯坦福狗数据集 (Stanford Dogs Dataset) 派生的公开标签语料库中提取图像来创建的，agent 可能会发现并利用这一点。 [#128](https://github.com/openai/mle-bench/issues/128) - **invasive-species-monitoring**: prepare.py 脚本错误地归档了准备好的 `train/` 和 `test/` 目录，因此 `train.7z` 和 `test.7z` 在准备好的公开数据集中可能会缺少其图像内容。 [#122](https://github.com/openai/mle-bench/issues/122) - **tabular-playground-series-dec-2021**: 排行榜过于拥挤——最高分和中位数分数之间的差异非常小。 - **tabular-playground-series-may-2022**: 排行榜过于拥挤——最高分和中位数分数之间的差异非常小。 - **jigsaw-toxic-comment-classification-challenge**: 排行榜过于拥挤——最高分和中位数分数之间的差异非常小。 - **champs-scalar-coupling**: 测试分子在 structures.csv 中缺失。 [#70](https://github.com/openai/mle-bench/pull/70) - **multi-modal-gesture-recognition**: 公开的测试 `.mat` 文件泄露了测试标签。 [#77](https://github.com/openai/mle-bench/issues/77) - **smartphone-decimeter-2022**: 公开的测试 `span_log.nmea` 文件泄露了信息，使得获得满分变得轻而易举。 [#93](https://github.com/openai/mle-bench/issues/93) - **hubmap-kidney-segmentation**: 公开的测试 `{image_id}.json` 文件泄露了信息，使得获得接近满分的分数变得轻而易举。应该将它们移除。 - **random-acts-of-pizza**: `giver_username_if_known` 字段泄露了结果，使得能够轻而易举地进行完美预测。应该放弃此比赛。 [#108](https://github.com/openai/mle-bench/issues/108) ## 作者 Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry ## 引用请使用以下 BibTeX 条目进行引用： ``` @article{chan2024mle-bench, title={MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering}, author={Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Lilian Weng and Aleksander Mądry}, year={2024}, eprint={2410.07095}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.07095} } ```

标签：AI智能体, Apex, LLM, Unmanaged PE, 人工智能, 机器学习, 用户模式Hook绕过, 评估工具, 逆向工具