meituan-longcat/WBench

GitHub: meituan-longcat/WBench

WBench 是一个面向交互式视频世界模型的多轮综合评估基准，通过五个维度、22 项指标对主流视频世界模型进行系统化评测与排名。

Stars: 174 | Forks: 9

WBench：用于
交互式视频世界模型评估的综合多轮 Benchmark

[![主页](https://img.shields.io/badge/Homepage-blue?style=for-the-badge&logo=google-chrome&logoColor=white)](https://meituan-longcat.github.io/WBench/) [![论文](https://img.shields.io/badge/Paper-red?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2605.25874) [![HF 每日论文](https://img.shields.io/badge/Daily_Paper_%232-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white&color=FF9D00)](https://huggingface.co/papers/2605.25874) [![排行榜](https://img.shields.io/badge/Leaderboard-32CD32?style=for-the-badge&logo=google-chrome&logoColor=white)](https://meituan-longcat.github.io/WBench/#leaderboard) [![数据集](https://img.shields.io/badge/Datasets-4285F4?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/meituan-longcat/WBench) [![权重](https://img.shields.io/badge/Weights-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/meituan-longcat/WBench-weights) [![示例](https://img.shields.io/badge/Examples-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/meituan-longcat/WBench-examples) [![ModelScope](https://img.shields.io/badge/ModelScope-6B4EFF?style=for-the-badge&logo=data:image/svg+xml;base64,PHN2ZyBmaWxsPSJ3aGl0ZSIgZmlsbC1ydWxlPSJldmVub2RkIiBoZWlnaHQ9IjFlbSIgc3R5bGU9ImZsZXg6bm9uZTtsaW5lLWhlaWdodDoxIiB2aWV3Qm94PSIwIDAgMjQgMjQiIHdpZHRoPSIxZW0iIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyI+PHRpdGxlPk1vZGVsU2NvcGU8L3RpdGxlPjxwYXRoIGQ9Ik0yLjY2NyA1LjNIOHYyLjY2N0g1LjMzM3YyLjY2NkgyLjY2N1Y4LjQ2N0guNXYyLjE2NmgyLjE2N1YxMy4zSDBWNy45NjdoMi42NjdWNS4zek0yLjY2NyAxMy4zaDIuNjY2djIuNjY3SDh2Mi42NjZIMi42NjdWMTMuM3pNOCAxMC42MzNoMi42NjdWMTMuM0g4di0yLjY2N3pNMTMuMzMzIDEzLjN2Mi42NjdoLTIuNjY2VjEzLjNoMi42NjZ6TTEzLjMzMyAxMy4zdi0yLjY2N0gxNlYxMy4zaC0yLjY2N3oiPjwvcGF0aD48cGF0aCBjbGlwLXJ1bGU9ImV2ZW5vZGQiIGQ9Ik0yMS4zMzMgMTMuM3YtMi42NjdoLTIuNjY2VjcuOTY3SDE2VjUuM2g1LjMzM3YyLjY2N0gyNFYxMy4zaC0yLjY2N3ptMC0yLjY2N0gyMy41VjguNDY3aC0yLjE2N3YyLjE2NnoiPjwvcGF0aD48cGF0aCBkPSJNMjEuMzMzIDEzLjN2NS4zMzNIMTZ2LTIuNjY2aDIuNjY3VjEzLjNoMi42NjZ6Ij48L3BhdGg+PC9zdmc+&logoColor=white)](https://modelscope.cn/datasets/meituan-longcat/WBench) [![中文解读](https://img.shields.io/badge/中文解读-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https://mp.weixin.qq.com/s/br3RlOBGtReolLZc5YW2HA) [![微信直播](https://img.shields.io/badge/WeChat_Live-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https://weixin.qq.com/sph/Aue3nWCWCx) [![TWITTER POST](https://img.shields.io/badge/TWITTER_POST-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/Meituan_LongCat/status/2059658634829996047) [![微信群](https://img.shields.io/badge/WeChat_Group-07C160?style=for-the-badge&logo=wechat&logoColor=white)](assets/wx_qr.png)

你的世界模型是全能选手吗？

简而言之 — WBench 在 5 个维度和 22 项指标上对 20 个视频世界模型进行了评估。

## 📢 最新消息 - **[2026/06/10]** 🧭 在 [WBench-examples](https://huggingface.co/datasets/meituan-longcat/WBench-examples) 中添加了 [HY-World 1.5 pose 导出](https://huggingface.co/datasets/meituan-longcat/WBench-examples/tree/main/hyworld1.5/poses)。 - **[2026/06/01]** WBench 现已成为 [Hugging Face](https://huggingface.co/datasets/meituan-longcat/WBench) 🤗 上的官方 benchmark（包含导航和全部任务）！ - **[2026/06/01]** 📦 发布了 [WBench-examples](https://huggingface.co/datasets/meituan-longcat/WBench-examples)：来自 HY-World 1.5 和 Kling 3.0 的即用型评估视频。 - **[2026/06/01]** 🎮 添加了[相机和动作条件示例](#-implement-your-model) + Web 自动化（Genie3、Happy Oyster）。 - **[2026/06/01]** 添加了用于生成、评估和提交的 [Claude Code skills](#-claude-code-skills) 🤖。 - **[2026/05/29]** 论文在 [Hugging Face 每日论文](https://huggingface.co/papers/2605.25874)中排名 **#2** 🏅！ - **[2026/05/28]** 论文现已在 [arXiv](https://arxiv.org/abs/2605.25874) 上发布 📄！ - **[2026/05/28]** 包含交互式[排行榜](https://meituan-longcat.github.io/WBench/#leaderboard)和[数据集画廊](https://meituan-longcat.github.io/WBench/#gallery)的[主页](https://meituan-longcat.github.io/WBench/)现已上线！🌐 - **[2026/05/28]** 🚀 发布了完整的 [WBench 数据集](https://huggingface.co/datasets/meituan-longcat/WBench)、[评估代码](https://github.com/meituan-longcat/WBench)和[模型权重](https://huggingface.co/meituan-longcat/WBench-weights)。 ## 🏆 排行榜 **20 个模型 — 导航划分（5 个维度，按平均分排序）** | # | 模型 | **平均分** | 质量 | 设置 | 交互 | 一致性 | 物理 | |:---:|:---|:---:|:---:|:---:|:---:|:---:|:---:| | 1 |

Kling 3.0 | **79.2 🥇** | 83.0 🥈 | 91.0 🥈 | 70.3 | 82.5 | 69.3 🥉 | | 2 |

LingBot-World | **78.8 🥈** | 81.5 | 72.6 | 79.8 | 88.9 🥇 | 71.2 🥈 | | 3 |

Wan 2.7 | **78.5 🥉** | 82.6 🥉 | 91.4 🥇 | 66.0 | 80.5 | 71.8 🥇 | | 4 |

HY-World 1.5 | **78.4** | 80.2 | 72.2 | 87.5 🥇 | 86.0 | 66.3 | | 5 |

HY-Video 1.5 | **78.2** | 79.7 | 85.6 🥉 | 71.8 | 86.7 🥉 | 67.4 | | 6 |

Happy Oyster | **77.1** | 79.3 | 74.2 | 85.1 🥈 | 83.3 | 63.5 | | 7 |

Seedance 1.5 | **76.5** | 83.2 🥇 | 82.9 | 68.0 | 80.2 | 68.4 | | 8 |

Cosmos 2.5 | **75.2** | 75.6 | 83.3 | 64.1 | 85.6 | 67.4 | | 9 |

LTX 2.3 | **74.4** | 78.7 | 85.2 | 67.6 | 75.6 | 64.9 | | 10 |

InSpatio-World | **74.3** | 74.9 | 71.4 | 72.8 | 87.4 🥈 | 65.2 | | 11 |

Fantasy-World | **74.2** | 75.5 | 71.3 | 72.1 | 85.3 | 66.8 | | 12 |

Genie 3 | **74.1** | 77.4 | 72.5 | 73.3 | 81.4 | 65.7 | | 13 |

LongCat-Video | **73.7** | 78.2 | 72.3 | 63.1 | 85.9 | 68.9 | | 14 |

YUME 1.5 | **73.5** | 79.5 | 72.4 | 72.0 | 78.6 | 65.2 | | 15 |

Infinite-World | **72.9** | 78.7 | 69.3 | 75.9 | 78.7 | 62.1 | | 16 |

MatrixGame3 | **71.2** | 76.9 | 63.6 | 83.5 🥉 | 72.9 | 59.3 | | 17 |

Kairos 3.0 | **70.7** | 76.4 | 70.3 | 65.1 | 81.4 | 60.4 | | 18 |

HY-GameCraft | **68.5** | 74.9 | 66.6 | 67.8 | 70.6 | 62.4 | | 19 |

MatrixGame2 | **68.5** | 75.7 | 67.1 | 80.6 | 62.0 | 57.2 | | 20 |

Astra | **64.0** | 69.7 | 59.6 | 67.7 | 71.6 | 51.4 | **9 个文本驱动模型 — 全集划分（5 个维度，按平均分排序）** | # | 模型 | **平均分** | 质量 | 设置 | 交互 | 一致性 | 物理 | |:---:|:---|:---:|:---:|:---:|:---:|:---:|:---:| | 1 |

Kling 3.0 | **79.5 🥇** | 81.8 🥉 | 91.0 🥈 | 73.1 🥇 | 82.6 | 69.2 🥈 | | 2 |

Wan 2.7 | **78.2 🥈** | 82.2 🥈 | 91.4 🥇 | 72.1 🥈 | 73.8 | 71.6 🥇 | | 3 |

Seedance 1.5 | **76.2 🥉** | 83.0 🥇 | 82.9 | 68.3 🥉 | 78.5 | 68.2 | | 4 |

HY-Video 1.5 | **74.6** | 78.9 | 85.6 🥉 | 54.7 | 86.8 🥇 | 67.1 | | 5 |

LTX 2.3 | **71.0** | 78.8 | 85.2 | 49.4 | 76.4 | 65.1 | | 6 |

Cosmos 2.5 | **70.8** | 74.6 | 83.3 | 43.5 | 85.4 🥉 | 67.0 | | 7 |

LongCat-Video | **70.2** | 79.7 | 72.3 | 45.1 | 85.5 🥈 | 68.4 🥉 | | 8 |

YUME 1.5 | **69.0** | 79.7 | 72.4 | 48.4 | 79.3 | 65.4 | | 9 |

Kairos 3.0 | **66.0** | 75.8 | 70.3 | 41.6 | 81.9 | 60.5 |

20 个模型 — 导航划分（19 项指标）

| 模型 | 美学质量 | 成像质量 | 背景一致性 | 时间闪烁 | 动态程度 | 运动平滑度 | HPSv3 质量 | 场景遵循度 | 主体遵循度 | 导航轨迹 | 空间一致性 | 门控空间一致性 | 视角一致性 | 片段连续性 | 几何一致性 | 光度一致性 | 跨模型主体一致性 | 视觉合理性 | 因果保真度 | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |

HY-Video 1.5 | 63.4 | 67.4 | 92.1 | 94.2 | 73.9 | 98.7 | 68.0 | 77.5 | 93.6 | 71.8 | 79.2 | 75.1 | 86.6 | 99.4 | 94.6 | 80.3 | 91.6 | 59.7 | 75.0 | |

Kling 3.0 | 63.0 | 68.1 | 92.3 | 93.2 | 97.5 | 97.6 | 69.1 | 89.0 | 92.9 | 70.3 | 75.2 | 75.1 | 76.8 | 93.0 | 88.9 | 79.9 | 88.5 | 60.7 | 78.0 | |

Cosmos 2.5 | 61.8 | 66.9 | 92.3 | 94.8 | 49.0 | 98.2 | 66.5 | 72.4 | 94.2 | 64.1 | 78.1 | 74.3 | 84.3 | 94.3 | 94.6 | 81.6 | 92.3 | 60.1 | 74.7 | |

LTX 2.3 | 57.9 | 61.0 | 88.3 | 93.2 | 98.1 | 96.4 | 56.1 | 81.3 | 89.2 | 67.6 | 70.2 | 70.2 | 69.8 | 75.8 | 76.9 | 79.2 | 87.2 | 55.7 | 74.0 | |

Seedance 1.5 | 61.0 | 69.3 | 89.6 | 92.4 | 99.4 | 97.5 | 73.0 | 71.6 | 94.2 | 68.0 | 72.7 | 72.4 | 70.5 | 96.2 | 82.4 | 76.8 | 90.1 | 60.7 | 76.0 | |

Wan 2.7 | 61.4 | 68.0 | 89.4 | 92.2 | 100.0 | 96.3 | 71.1 | 88.3 | 94.6 | 66.0 | 71.0 | 71.0 | 78.2 | 92.4 | 83.7 | 76.4 | 90.7 | 60.3 | 83.3 | |

Kairos 3.0 | 59.9 | 62.7 | 91.1 | 95.4 | 70.1 | 97.5 | 58.5 | 52.2 | 88.5 | 65.1 | 76.8 | 62.0 | 76.3 | 94.3 | 89.0 | 80.8 | 90.8 | 58.0 | 62.7 | |

LongCat-Video | 66.5 | 69.6 | 95.1 | 94.8 | 45.9 | 97.9 | 77.6 | 53.1 | 91.5 | 63.1 | 83.3 | 66.2 | 81.5 | 99.4 | 95.4 | 82.2 | 93.4 | 61.8 | 76.0 | |

YUME 1.5 | 58.7 | 63.3 | 90.3 | 93.0 | 96.8 | 97.0 | 57.0 | 53.1 | 91.7 | 72.0 | 71.5 | 71.4 | 48.0 | 99.4 | 88.0 | 83.3 | 88.8 | 57.7 | 72.7 | |

Astra | 48.6 | 52.5 | 85.3 | 96.0 | 79.6 | 97.7 | 28.0 | 43.4 | 75.9 | 67.7 | 64.7 | 63.3 | 30.0 | 86.6 | 85.6 | 87.5 | 83.5 | 54.6 | 48.3 | |

Fantasy-World | 63.0 | 62.8 | 94.2 | 95.8 | 49.0 | 97.9 | 65.8 | 524 | 90.1 | 72.1 | 80.6 | 64.2 | 79.8 | 100.0 | 95.3 | 84.8 | 92.5 | 59.7 | 74.0 | |

HY-GameCraft | 52.6 | 58.7 | 86.5 | 93.7 | 96.8 | 97.6 | 38.3 | 50.6 | 82.5 | 67.8 | 60.5 | 60.5 | 17.9 | 99.4 | 88.3 | 85.0 | 82.6 | 56.5 | 68.3 | |

Genie 3 | 51.6 | 59.3 | 90.7 | 95.0 | 92.4 | 97.8 | 55.2 | 61.1 | 83.8 | 73.3 | 79.9 | 78.4 | 54.5 | 93.6 | 88.6 | 84.5 | 90.4 | 59.7 | 71.7 | |

Happy Oyster | 56.6 | 63.9 | 91.4 | 94.0 | 94.2 | 97.0 | 58.3 | 57.4 | 91.1 | 85.1 | 77.7 | 75.8 | 75.0 | 96.2 | 87.2 | 79.8 | 91.5 | 57.6 | 69.3 | |

HY-World 1.5 | 60.1 | 65.4 | 92.7 | 93.5 | 91.1 | 98.1 | 60.5 | 53.5 | 90.8 | 87.5 | 90.6 | 84.9 | 62.5 | 100.0 | 92.0 | 83.1 | 89.1 | 58.6 | 74.0 | |

Infinite-World | 58.7 | 66.1 | 88.8 | 94.1 | 82.8 | 98.0 | 62.3 | 54.0 | 84.5 | 75.9 | 74.9 | 74.4 | 33.8 | 100.0 | 94.3 | 85.1 | 88.4 | 57.2 | 67.0 | |

InSpatio-World | 64.4 | 67.6 | 95.0 | 96.0 | 26.1 | 98.8 | 76.1 | 51.7 | 91.1 | 72.8 | 93.8 | 66.5 | 72.5 | 100.0 | 97.3 | 87.4 | 94.4 | 63.1 | 67.3 | |

LingBot-World | 66.9 | 67.9 | 96.9 | 94.1 | 66.2 | 96.9 | 81.4 | 51.6 | 93.6 | 79.8 | 92.7 | 67.1 | 90.9 | 99.4 | 95.4 | 83.3 | 93.5 | 64.8 | 77.7 | |

MatrixGame2 | 54.0 | 60.3 | 86.9 | 94.6 | 94.9 | 98.2 | 41.0 | 49.4 | 84.9 | 80.6 | 64.5 | 64.5 | 29.2 | 21.0 | 86.1 | 81.3 | 87.2 | 55.0 | 59.3 | |

MatrixGame3 | 46.4 | 70.0 | 85.7 | 86.3 | 97.5 | 95.4 | 57.1 | 48.9 | 78.4 | 83.5 | 81.0 | 80.4 | 13.3 | 89.8 | 87.6 | 75.3 | 83.0 | 54.0 | 64.7 |

9 个文本驱动模型 — 全集划分（22 项指标）

| 模型 | 美学质量 | 成像质量 | 背景一致性 | 时间闪烁 | 动态程度 | 运动平滑度 | HPSv3 质量 | 场景遵循度 | 主体遵循度 | 导航轨迹 | 事件编辑遵循度 | 主体动作遵循度 | 视角切换遵循度 | 空间一致性 | 门控空间一致性 | 视角一致性 | 片段连续性 | 几何一致性 | 光度一致性 | 跨模型主体一致性 | 视觉合理性 | 因果保真度 | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |

HY-Video 1.5 | 61.9 | 67.4 | 92.4 | 95.5 | 68.8 | 98.8 | 67.5 | 77.5 | 93.6 | 71.8 | 63.8 | 55.6 | 27.6 | 79.2 | 75.1 | 86.6 | 99.3 | 94.4 | 81.4 | 91.5 | 59.3 | 75.0 | |

Kling 3.0 | 61.3 | 67.7 | 92.7 | 94.5 | 89.9 | 97.9 | 68.8 | 89.0 | 92.9 | 70.3 | 81.4 | 85.6 | 55.0 | 75.2 | 75.1 | 76.8 | 92.7 | 89.4 | 80.4 | 88.5 | 60.4 | 78.0 | |

Cosmos 2.5 | 60.1 | 67.2 | 92.3 | 96.0 | 42.4 | 98.3 | 65.9 | 72.4 | 94.2 | 64.1 | 48.2 | 41.6 | 20.0 | 78.1 | 74.3 &; | 84.3 | 93.1 | 94.2 | 82.1 | 91.8 | 59.3 | 74.7 | |

LTX 2.3 | 56.9 | 62.3 | 89.3 | 94.1 | 94.4 | 96.8 | 57.7 | 81.3 | 89.2 | 67.6 | 53.0 | 51.8 | 25.0 | 70.2 | 70.2 | 69.8 | 77.8 | 81.1 | 79.4 | 86.7 | 56.2 | 74.0 | |

Seedance 1.5 | 59.7 | 69.8 | 89.6 | 93.4 | 98.3 | 97.6 | 72.9 | 71.6 | 94.2 | 68.0 | 80.4 | 80.0 | 45.0 | 72.7 | 72.4 | 62.7 | 92.4 | 83.5 | 76.7 | 89.3 | 60.5 | 76.0 | |

Wan 2.7 | 59.6 | 68.1 | 89.5 | 93.0 | 99.3 | 96.5 | 69.4 | 88.3 | 94.6 | 66.0 | 84.0 | 83.4 | 55.0 | 71.0 | 71.0 | 62.2 | 65.6 | 82.6 | 75.5 | 88.7 | 59.8 | 83.3 | |

Kairos 3.0 | 58.4 | 63.6 | 91.8 | 96.3 | 63.5 | 97.9 | 58.8 | 52.2 | 88.5 | 65.1 | 46.8 | 41.4 | 13.3 | 76.8 | 62.0 | 76.3 | 94.1 | 91.5 | 82.1 | 90.7 | 58.2 | 62.7 | |

LongCat-Video | 64.7 | 69.8 | 94.7 | 94.9 | 59.7 | 97.7 | 76.3 | 53.1 | 91.5 | 63.1 | 50.4 | 48.4 | 18.3 | 83.3 | 66.2 | 81.5 | 98.6 | 94.7 | 81.5 | 92.4 | 60.8 | 76.0 | |

YUME 1.5 | 59.3 | 65.7 | 92.0 | 94.8 | 86.1 | 97.7 | 62.0 | 53.1 | 91.7 | 72.0 | 57.8 | 47.0 | 16.7 | 71.5 | 71.4 | 48.0 | 99.3 | 91.1 | 84.1 | 89.4 | 58.1 | 72.7 |

## 🚀 快速开始 ``` # 安装 git clone --recursive https://github.com/meituan-longcat/WBench.git cd WBench # 如果已经 clone 但没有包含 submodules git submodule update --init --recursive # 下载 data 和 weights pip install huggingface_hub hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*" hf download meituan-longcat/WBench-weights --local-dir weights/ # Environment 1: wbench-main（除 visual_plausibility 外的所有 metrics） # 第 2 个 arg = PyTorch 的 CUDA build — 需与您的系统匹配（通过 `nvcc --version` 检查）： # cu124 → CUDA 12.x cu121 → CUDA 12.1 cu118 → CUDA 11.8 # 始终显式传递它：如果省略，当 nvcc 不在 PATH 中时，自动检测会回退到 cu118， # 这会导致 MegaSAM CUDA extensions 无法在 CUDA-12 机器上 build。 bash tools/install.sh wbench-main cu124 conda activate wbench-main export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH # 验证 conda activate wbench-main python tools/verify_install.py # 运行 evaluation（自动 multi-GPU） python main.py --model your_model ``` 有关详细的设置说明，请参阅 [docs/installation.md](docs/installation.md)。 ## 🎮 评估你的模型首先为 VLM 指标设置环境变量（我们通过 [Volcengine ARK](https://www.volcengine.com/docs/82379/1099475) 使用 [Doubao-Seed-2.0-lite](https://console.volcengine.com/ark/region:ark+cn-beijing/model/detail?Id=doubao-seed-2-0-lite)）： ``` export VLM_API_KEY="" # 可选（显示默认值）： # export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3" # export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215" ``` 1. 生成多轮视频 → 放置在 `work_dirs//videos/case_{id}_combined.mp4` 2. 运行 3 阶段 pipeline： ``` # Full pipeline（precompute → GPU metrics → VLM metrics → report） python main.py --model my_model --gpus 0,1,2,3,4,5,6,7 # 或独立运行各阶段： python main.py --model my_model --phase precompute # SAM2 + DA3 + MegaSAM python main.py --model my_model --phase gpu # GPU metrics (per-metric) python main.py --model my_model --phase vlm # VLM metrics (API) python main.py --model my_model --phase report # Aggregate report ``` **注意：**上述 pipeline 涵盖了 22 项指标中的 21 项。`visual_plausibility` 是个例外——它在**单独的 `wbench-vp` 环境**中运行（在[快速开始](#-quick-start)中设置）： ``` conda activate wbench-vp python tools/run_visual_plausibility.py --model my_model # uses all available GPUs ``` 3. 结果：`work_dirs//evaluation/{metric}/case_{id}.json` + `report.json` ``` # 运行特定的 metrics（按名称或维度） python main.py --model my_model --phase gpu --metrics hpsv3_quality python main.py --model my_model --phase gpu --metrics quality # all 6 video quality python main.py --model my_model --phase gpu --metrics consistency # all consistency metrics # 如果已经完成，则跳过 pre-computation python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3 # 单个 video evaluation python main.py --video video.mp4 --case data/cases/case_1.json ``` **维度**（`--metrics` 支持以下简写）： | 维度 | 指标 | |:---|:---| | `quality` | aesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality | | `consistency` | background_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency | | `interaction` | navigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence | | `setting` | scene_adherence, subject_adherence | | `physical` | visual_plausibility, causal_fidelity | ## 🔥 接入你的模型 WBench 支持 3 种具有不同控制接口的模型类型： | 类型 | 输入 | 用例 | 状态 | |:---|:---|:---:|:---:| | **文本条件** | 文本 prompt + 首帧图像 | 289 (全部) | ✅ 已实现 | | **相机条件** | 首帧图像 + 6-DoF 相机 pose | 158 (导航) | ✅ 已实现 | | **动作条件** | 首帧图像 + 离散动作 | 158 (导航) | ✅ 已实现 | ### 文本条件模型 ``` from src.models import get_model # 可用：wan, kling, seedance（或注册您自己的） model = get_model("wan") # 从单个 case 生成 multi-turn video result = model.generate_multi_turn( case=case_dict, output_path="work_dirs/wan/videos/case_1_combined.mp4", data_root="data/", ) ``` 每一轮：从交互中构建 prompt → 调用 I2V API → 提取最后一帧 → 下一轮。设置 API 凭证： ``` export VIDEO_API_URL="https://your-video-api.com" export VIDEO_API_KEY="your-key" ``` ### 相机条件模型 Benchmark 的导航动作（W/A/S/D + 方向键）被转换为每轮的 `{move, yaw, pitch}` intent，然后再转换为 6-DoF 相机轨迹。子类化 `CameraConditionedModel` 并实现一个 hook —— 用例解析、动作转 pose 转换以及视频写入都会为你自动处理： ``` from src.models.camera import CameraConditionedModel class MyWorldModel(CameraConditionedModel): def generate_with_poses(self, image, poses, video_length, **kw): # image: first-frame path; poses: {"": {"extrinsic": 4x4, "K": 3x3}, ...} # return: list of `video_length` BGR uint8 frames return my_model.infer(image, poses, video_length) MyWorldModel("mymodel").generate_multi_turn(case_dict, "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/") ``` pose 约定（轴、速度、内参）位于 `src/models/camera/poses.py` 中——将其复制并适配到你的模型；导航指标会标准化比例，因此重要的是匹配每个动作的 *intent*。快速查看一个用例： ``` python -m src.models.camera.demo --case data/cases/case_1.json # prints poses + renders a preview ``` ### 动作条件模型两种风格，均来自相同的每轮导航计划： **程序化控制器**（例如 Matrix-Game-3）。子类化 `ActionConditionedModel` 并实现 `generate_with_actions`。每个动作都带有原始按键 `token` 和 MG3 风格的 `{keyboard, mouse}` 张量： ``` from src.models.action import ActionConditionedModel class MyActionModel(ActionConditionedModel): def generate_with_actions(self, image, actions, video_length, **kw): # actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...] return my_model.infer(image, actions, video_length) MyActionModel("mymodel").generate_multi_turn(case_dict, "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/") ``` ``` python -m src.models.action.demo --case data/cases/case_1.json # prints actions + renders a preview ``` **Web 产品**（例如 Project Genie、Happy Oyster）——无权重/API；由浏览器自动化 + 模拟按键驱动。详见 [`src/models/action/web/`](src/models/action/web/README.md)。 ## 🤖 Claude Code Skills 如果你使用 [Claude Code](https://claude.com/claude-code)，此代码库自带了驱动完整工作流的 skills——只需用自然语言提问，Claude 就会运行正确的命令： | Skill | 触发词 | 功能说明 | |:---|:---|:---| | `wbench-generate` | "generate kling videos" | 在数据集上运行 `generate.py` → `work_dirs//videos/` | | `wbench-evaluate` | "evaluate kling3" | 运行 4 阶段 `main.py` pipeline（预计算 → gpu → vlm → 报告） | | `wbench-submit` | "package my model for submission" | 构建 `meta.json` / `turns.json` 捆绑包并上传至 HuggingFace | | `genie3` / `happy` | "run case_5 on genie3" | Web 产品的浏览器自动化（[详情](src/models/action/web/README.md)） | Skills 位于 `.claude/skills/`（以及 `src/models/action/web/.claude/skills/`）中，并且当你在 Claude Code 中打开代码库时会自动发现。 ## 📋 TODO - [x] 文本条件模型生成（Wan、Kling、Seedance） - [x] 带有交互式排行榜的主页 - [x] 在 HuggingFace 上发布数据集和权重 - [x] 相机条件模型生成示例 - [x] 动作条件模型生成示例 - [ ] 托管的提交和评估服务（提交视频，获取分数） - [x] ArXiv 论文发布 ## 📝 引用如果你觉得我们的工作有用，请考虑引用： ``` @article{ying2026wbenchcomprehensivemultiturnbenchmark, title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation}, author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui}, journal={arXiv preprint arXiv:2605.25874}, year={2026} } ``` ## 🙏 致谢本项目基于以下优秀的工作构建： - [WorldScore](https://github.com/WorldScore/WorldScore) — 世界模型评估框架 - [VBench](https://github.com/Vchitect/VBench) — 视频质量指标 - [SAM2](https://github.com/facebookresearch/sam2) — 用于 mask 追踪的 Segment Anything Model 2 - [Depth-Anything-V3](https://github.com/DepthAnything/Depth-Anything-V3) — 单目深度估计 - [MegaSAM](https://github.com/mega-sam/mega-sam) — 相机 pose 估计 - [DreamSim](https://github.com/ssundaram21/dreamsim) — 感知相似度指标 - [HPSv3](

标签：AI评测, IaC 扫描, 世界模型, 人工智能, 多轮交互, 用户模式Hook绕过, 视频生成, 逆向工具

meituan-longcat/WBench

WBench：用于交互式视频世界模型评估的综合多轮 Benchmark

WBench：用于
交互式视频世界模型评估的综合多轮 Benchmark