WBench:用于
交互式视频世界模型评估的综合多轮 Benchmark
[](https://meituan-longcat.github.io/WBench/)
[](https://arxiv.org/abs/2605.25874)
[](https://huggingface.co/papers/2605.25874)
[](https://meituan-longcat.github.io/WBench/#leaderboard)
[](https://huggingface.co/datasets/meituan-longcat/WBench)
[](https://huggingface.co/meituan-longcat/WBench-weights)
[](https://huggingface.co/datasets/meituan-longcat/WBench-examples)
[](https://modelscope.cn/datasets/meituan-longcat/WBench)
[](https://mp.weixin.qq.com/s/br3RlOBGtReolLZc5YW2HA)
[](https://weixin.qq.com/sph/Aue3nWCWCx)
[](https://x.com/Meituan_LongCat/status/2059658634829996047)
[](assets/wx_qr.png)
你的世界模型是全能选手吗?
简而言之 — WBench 在 5 个维度和 22 项指标上对 20 个视频世界模型进行了评估。
## 📢 最新消息
- **[2026/06/10]** 🧭 在 [WBench-examples](https://huggingface.co/datasets/meituan-longcat/WBench-examples) 中添加了 [HY-World 1.5 pose 导出](https://huggingface.co/datasets/meituan-longcat/WBench-examples/tree/main/hyworld1.5/poses)。
- **[2026/06/01]** WBench 现已成为 [Hugging Face](https://huggingface.co/datasets/meituan-longcat/WBench) 🤗 上的官方 benchmark(包含导航和全部任务)!
- **[2026/06/01]** 📦 发布了 [WBench-examples](https://huggingface.co/datasets/meituan-longcat/WBench-examples):来自 HY-World 1.5 和 Kling 3.0 的即用型评估视频。
- **[2026/06/01]** 🎮 添加了[相机和动作条件示例](#-implement-your-model) + Web 自动化(Genie3、Happy Oyster)。
- **[2026/06/01]** 添加了用于生成、评估和提交的 [Claude Code skills](#-claude-code-skills) 🤖。
- **[2026/05/29]** 论文在 [Hugging Face 每日论文](https://huggingface.co/papers/2605.25874)中排名 **#2** 🏅!
- **[2026/05/28]** 论文现已在 [arXiv](https://arxiv.org/abs/2605.25874) 上发布 📄!
- **[2026/05/28]** 包含交互式[排行榜](https://meituan-longcat.github.io/WBench/#leaderboard)和[数据集画廊](https://meituan-longcat.github.io/WBench/#gallery)的[主页](https://meituan-longcat.github.io/WBench/)现已上线!🌐
- **[2026/05/28]** 🚀 发布了完整的 [WBench 数据集](https://huggingface.co/datasets/meituan-longcat/WBench)、[评估代码](https://github.com/meituan-longcat/WBench)和[模型权重](https://huggingface.co/meituan-longcat/WBench-weights)。
## 🏆 排行榜
**20 个模型 — 导航划分(5 个维度,按平均分排序)**
| # | 模型 | **平均分** | 质量 | 设置 | 交互 | 一致性 | 物理 |
|:---:|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 |

Kling 3.0 | **79.2 🥇** | 83.0 🥈 | 91.0 🥈 | 70.3 | 82.5 | 69.3 🥉 |
| 2 |

LingBot-World | **78.8 🥈** | 81.5 | 72.6 | 79.8 | 88.9 🥇 | 71.2 🥈 |
| 3 |

Wan 2.7 | **78.5 🥉** | 82.6 🥉 | 91.4 🥇 | 66.0 | 80.5 | 71.8 🥇 |
| 4 |

HY-World 1.5 | **78.4** | 80.2 | 72.2 | 87.5 🥇 | 86.0 | 66.3 |
| 5 |

HY-Video 1.5 | **78.2** | 79.7 | 85.6 🥉 | 71.8 | 86.7 🥉 | 67.4 |
| 6 |

Happy Oyster | **77.1** | 79.3 | 74.2 | 85.1 🥈 | 83.3 | 63.5 |
| 7 |

Seedance 1.5 | **76.5** | 83.2 🥇 | 82.9 | 68.0 | 80.2 | 68.4 |
| 8 |

Cosmos 2.5 | **75.2** | 75.6 | 83.3 | 64.1 | 85.6 | 67.4 |
| 9 |

LTX 2.3 | **74.4** | 78.7 | 85.2 | 67.6 | 75.6 | 64.9 |
| 10 |

InSpatio-World | **74.3** | 74.9 | 71.4 | 72.8 | 87.4 🥈 | 65.2 |
| 11 |

Fantasy-World | **74.2** | 75.5 | 71.3 | 72.1 | 85.3 | 66.8 |
| 12 |

Genie 3 | **74.1** | 77.4 | 72.5 | 73.3 | 81.4 | 65.7 |
| 13 |

LongCat-Video | **73.7** | 78.2 | 72.3 | 63.1 | 85.9 | 68.9 |
| 14 |

YUME 1.5 | **73.5** | 79.5 | 72.4 | 72.0 | 78.6 | 65.2 |
| 15 |

Infinite-World | **72.9** | 78.7 | 69.3 | 75.9 | 78.7 | 62.1 |
| 16 |

MatrixGame3 | **71.2** | 76.9 | 63.6 | 83.5 🥉 | 72.9 | 59.3 |
| 17 |

Kairos 3.0 | **70.7** | 76.4 | 70.3 | 65.1 | 81.4 | 60.4 |
| 18 |

HY-GameCraft | **68.5** | 74.9 | 66.6 | 67.8 | 70.6 | 62.4 |
| 19 |

MatrixGame2 | **68.5** | 75.7 | 67.1 | 80.6 | 62.0 | 57.2 |
| 20 |

Astra | **64.0** | 69.7 | 59.6 | 67.7 | 71.6 | 51.4 |
**9 个文本驱动模型 — 全集划分(5 个维度,按平均分排序)**
| # | 模型 | **平均分** | 质量 | 设置 | 交互 | 一致性 | 物理 |
|:---:|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 |

Kling 3.0 | **79.5 🥇** | 81.8 🥉 | 91.0 🥈 | 73.1 🥇 | 82.6 | 69.2 🥈 |
| 2 |

Wan 2.7 | **78.2 🥈** | 82.2 🥈 | 91.4 🥇 | 72.1 🥈 | 73.8 | 71.6 🥇 |
| 3 |

Seedance 1.5 | **76.2 🥉** | 83.0 🥇 | 82.9 | 68.3 🥉 | 78.5 | 68.2 |
| 4 |

HY-Video 1.5 | **74.6** | 78.9 | 85.6 🥉 | 54.7 | 86.8 🥇 | 67.1 |
| 5 |

LTX 2.3 | **71.0** | 78.8 | 85.2 | 49.4 | 76.4 | 65.1 |
| 6 |

Cosmos 2.5 | **70.8** | 74.6 | 83.3 | 43.5 | 85.4 🥉 | 67.0 |
| 7 |

LongCat-Video | **70.2** | 79.7 | 72.3 | 45.1 | 85.5 🥈 | 68.4 🥉 |
| 8 |

YUME 1.5 | **69.0** | 79.7 | 72.4 | 48.4 | 79.3 | 65.4 |
| 9 |

Kairos 3.0 | **66.0** | 75.8 | 70.3 | 41.6 | 81.9 | 60.5 |
20 个模型 — 导航划分(19 项指标)
| 模型 | 美学质量 | 成像质量 | 背景一致性 | 时间闪烁 | 动态程度 | 运动平滑度 | HPSv3 质量 | 场景遵循度 | 主体遵循度 | 导航轨迹 | 空间一致性 | 门控空间一致性 | 视角一致性 | 片段连续性 | 几何一致性 | 光度一致性 | 跨模型主体一致性 | 视觉合理性 | 因果保真度 |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
HY-Video 1.5 | 63.4 | 67.4 | 92.1 | 94.2 | 73.9 | 98.7 | 68.0 | 77.5 | 93.6 | 71.8 | 79.2 | 75.1 | 86.6 | 99.4 | 94.6 | 80.3 | 91.6 | 59.7 | 75.0 |
|
Kling 3.0 | 63.0 | 68.1 | 92.3 | 93.2 | 97.5 | 97.6 | 69.1 | 89.0 | 92.9 | 70.3 | 75.2 | 75.1 | 76.8 | 93.0 | 88.9 | 79.9 | 88.5 | 60.7 | 78.0 |
|
Cosmos 2.5 | 61.8 | 66.9 | 92.3 | 94.8 | 49.0 | 98.2 | 66.5 | 72.4 | 94.2 | 64.1 | 78.1 | 74.3 | 84.3 | 94.3 | 94.6 | 81.6 | 92.3 | 60.1 | 74.7 |
|
LTX 2.3 | 57.9 | 61.0 | 88.3 | 93.2 | 98.1 | 96.4 | 56.1 | 81.3 | 89.2 | 67.6 | 70.2 | 70.2 | 69.8 | 75.8 | 76.9 | 79.2 | 87.2 | 55.7 | 74.0 |
|
Seedance 1.5 | 61.0 | 69.3 | 89.6 | 92.4 | 99.4 | 97.5 | 73.0 | 71.6 | 94.2 | 68.0 | 72.7 | 72.4 | 70.5 | 96.2 | 82.4 | 76.8 | 90.1 | 60.7 | 76.0 |
|
Wan 2.7 | 61.4 | 68.0 | 89.4 | 92.2 | 100.0 | 96.3 | 71.1 | 88.3 | 94.6 | 66.0 | 71.0 | 71.0 | 78.2 | 92.4 | 83.7 | 76.4 | 90.7 | 60.3 | 83.3 |
|
Kairos 3.0 | 59.9 | 62.7 | 91.1 | 95.4 | 70.1 | 97.5 | 58.5 | 52.2 | 88.5 | 65.1 | 76.8 | 62.0 | 76.3 | 94.3 | 89.0 | 80.8 | 90.8 | 58.0 | 62.7 |
|
LongCat-Video | 66.5 | 69.6 | 95.1 | 94.8 | 45.9 | 97.9 | 77.6 | 53.1 | 91.5 | 63.1 | 83.3 | 66.2 | 81.5 | 99.4 | 95.4 | 82.2 | 93.4 | 61.8 | 76.0 |
|
YUME 1.5 | 58.7 | 63.3 | 90.3 | 93.0 | 96.8 | 97.0 | 57.0 | 53.1 | 91.7 | 72.0 | 71.5 | 71.4 | 48.0 | 99.4 | 88.0 | 83.3 | 88.8 | 57.7 | 72.7 |
|
Astra | 48.6 | 52.5 | 85.3 | 96.0 | 79.6 | 97.7 | 28.0 | 43.4 | 75.9 | 67.7 | 64.7 | 63.3 | 30.0 | 86.6 | 85.6 | 87.5 | 83.5 | 54.6 | 48.3 |
|
Fantasy-World | 63.0 | 62.8 | 94.2 | 95.8 | 49.0 | 97.9 | 65.8 | 524 | 90.1 | 72.1 | 80.6 | 64.2 | 79.8 | 100.0 | 95.3 | 84.8 | 92.5 | 59.7 | 74.0 |
|
HY-GameCraft | 52.6 | 58.7 | 86.5 | 93.7 | 96.8 | 97.6 | 38.3 | 50.6 | 82.5 | 67.8 | 60.5 | 60.5 | 17.9 | 99.4 | 88.3 | 85.0 | 82.6 | 56.5 | 68.3 |
|
Genie 3 | 51.6 | 59.3 | 90.7 | 95.0 | 92.4 | 97.8 | 55.2 | 61.1 | 83.8 | 73.3 | 79.9 | 78.4 | 54.5 | 93.6 | 88.6 | 84.5 | 90.4 | 59.7 | 71.7 |
|
Happy Oyster | 56.6 | 63.9 | 91.4 | 94.0 | 94.2 | 97.0 | 58.3 | 57.4 | 91.1 | 85.1 | 77.7 | 75.8 | 75.0 | 96.2 | 87.2 | 79.8 | 91.5 | 57.6 | 69.3 |
|
HY-World 1.5 | 60.1 | 65.4 | 92.7 | 93.5 | 91.1 | 98.1 | 60.5 | 53.5 | 90.8 | 87.5 | 90.6 | 84.9 | 62.5 | 100.0 | 92.0 | 83.1 | 89.1 | 58.6 | 74.0 |
|
Infinite-World | 58.7 | 66.1 | 88.8 | 94.1 | 82.8 | 98.0 | 62.3 | 54.0 | 84.5 | 75.9 | 74.9 | 74.4 | 33.8 | 100.0 | 94.3 | 85.1 | 88.4 | 57.2 | 67.0 |
|
InSpatio-World | 64.4 | 67.6 | 95.0 | 96.0 | 26.1 | 98.8 | 76.1 | 51.7 | 91.1 | 72.8 | 93.8 | 66.5 | 72.5 | 100.0 | 97.3 | 87.4 | 94.4 | 63.1 | 67.3 |
|
LingBot-World | 66.9 | 67.9 | 96.9 | 94.1 | 66.2 | 96.9 | 81.4 | 51.6 | 93.6 | 79.8 | 92.7 | 67.1 | 90.9 | 99.4 | 95.4 | 83.3 | 93.5 | 64.8 | 77.7 |
|
MatrixGame2 | 54.0 | 60.3 | 86.9 | 94.6 | 94.9 | 98.2 | 41.0 | 49.4 | 84.9 | 80.6 | 64.5 | 64.5 | 29.2 | 21.0 | 86.1 | 81.3 | 87.2 | 55.0 | 59.3 |
|
MatrixGame3 | 46.4 | 70.0 | 85.7 | 86.3 | 97.5 | 95.4 | 57.1 | 48.9 | 78.4 | 83.5 | 81.0 | 80.4 | 13.3 | 89.8 | 87.6 | 75.3 | 83.0 | 54.0 | 64.7 |
9 个文本驱动模型 — 全集划分(22 项指标)
| 模型 | 美学质量 | 成像质量 | 背景一致性 | 时间闪烁 | 动态程度 | 运动平滑度 | HPSv3 质量 | 场景遵循度 | 主体遵循度 | 导航轨迹 | 事件编辑遵循度 | 主体动作遵循度 | 视角切换遵循度 | 空间一致性 | 门控空间一致性 | 视角一致性 | 片段连续性 | 几何一致性 | 光度一致性 | 跨模型主体一致性 | 视觉合理性 | 因果保真度 |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
HY-Video 1.5 | 61.9 | 67.4 | 92.4 | 95.5 | 68.8 | 98.8 | 67.5 | 77.5 | 93.6 | 71.8 | 63.8 | 55.6 | 27.6 | 79.2 | 75.1 | 86.6 | 99.3 | 94.4 | 81.4 | 91.5 | 59.3 | 75.0 |
|
Kling 3.0 | 61.3 | 67.7 | 92.7 | 94.5 | 89.9 | 97.9 | 68.8 | 89.0 | 92.9 | 70.3 | 81.4 | 85.6 | 55.0 | 75.2 | 75.1 | 76.8 | 92.7 | 89.4 | 80.4 | 88.5 | 60.4 | 78.0 |
|
Cosmos 2.5 | 60.1 | 67.2 | 92.3 | 96.0 | 42.4 | 98.3 | 65.9 | 72.4 | 94.2 | 64.1 | 48.2 | 41.6 | 20.0 | 78.1 | 74.3 &; | 84.3 | 93.1 | 94.2 | 82.1 | 91.8 | 59.3 | 74.7 |
|
LTX 2.3 | 56.9 | 62.3 | 89.3 | 94.1 | 94.4 | 96.8 | 57.7 | 81.3 | 89.2 | 67.6 | 53.0 | 51.8 | 25.0 | 70.2 | 70.2 | 69.8 | 77.8 | 81.1 | 79.4 | 86.7 | 56.2 | 74.0 |
|
Seedance 1.5 | 59.7 | 69.8 | 89.6 | 93.4 | 98.3 | 97.6 | 72.9 | 71.6 | 94.2 | 68.0 | 80.4 | 80.0 | 45.0 | 72.7 | 72.4 | 62.7 | 92.4 | 83.5 | 76.7 | 89.3 | 60.5 | 76.0 |
|
Wan 2.7 | 59.6 | 68.1 | 89.5 | 93.0 | 99.3 | 96.5 | 69.4 | 88.3 | 94.6 | 66.0 | 84.0 | 83.4 | 55.0 | 71.0 | 71.0 | 62.2 | 65.6 | 82.6 | 75.5 | 88.7 | 59.8 | 83.3 |
|
Kairos 3.0 | 58.4 | 63.6 | 91.8 | 96.3 | 63.5 | 97.9 | 58.8 | 52.2 | 88.5 | 65.1 | 46.8 | 41.4 | 13.3 | 76.8 | 62.0 | 76.3 | 94.1 | 91.5 | 82.1 | 90.7 | 58.2 | 62.7 |
|
LongCat-Video | 64.7 | 69.8 | 94.7 | 94.9 | 59.7 | 97.7 | 76.3 | 53.1 | 91.5 | 63.1 | 50.4 | 48.4 | 18.3 | 83.3 | 66.2 | 81.5 | 98.6 | 94.7 | 81.5 | 92.4 | 60.8 | 76.0 |
|
YUME 1.5 | 59.3 | 65.7 | 92.0 | 94.8 | 86.1 | 97.7 | 62.0 | 53.1 | 91.7 | 72.0 | 57.8 | 47.0 | 16.7 | 71.5 | 71.4 | 48.0 | 99.3 | 91.1 | 84.1 | 89.4 | 58.1 | 72.7 |
## 🚀 快速开始
```
# 安装
git clone --recursive https://github.com/meituan-longcat/WBench.git
cd WBench
# 如果已经 clone 但没有包含 submodules
git submodule update --init --recursive
# 下载 data 和 weights
pip install huggingface_hub
hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*"
hf download meituan-longcat/WBench-weights --local-dir weights/
# Environment 1: wbench-main(除 visual_plausibility 外的所有 metrics)
# 第 2 个 arg = PyTorch 的 CUDA build — 需与您的系统匹配(通过 `nvcc --version` 检查):
# cu124 → CUDA 12.x cu121 → CUDA 12.1 cu118 → CUDA 11.8
# 始终显式传递它:如果省略,当 nvcc 不在 PATH 中时,自动检测会回退到 cu118,
# 这会导致 MegaSAM CUDA extensions 无法在 CUDA-12 机器上 build。
bash tools/install.sh wbench-main cu124
conda activate wbench-main
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# 验证
conda activate wbench-main
python tools/verify_install.py
# 运行 evaluation(自动 multi-GPU)
python main.py --model your_model
```
有关详细的设置说明,请参阅 [docs/installation.md](docs/installation.md)。
## 🎮 评估你的模型
首先为 VLM 指标设置环境变量(我们通过 [Volcengine ARK](https://www.volcengine.com/docs/82379/1099475) 使用 [Doubao-Seed-2.0-lite](https://console.volcengine.com/ark/region:ark+cn-beijing/model/detail?Id=doubao-seed-2-0-lite)):
```
export VLM_API_KEY="
"
# 可选(显示默认值):
# export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3"
# export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215"
```
1. 生成多轮视频 → 放置在 `work_dirs//videos/case_{id}_combined.mp4`
2. 运行 3 阶段 pipeline:
```
# Full pipeline(precompute → GPU metrics → VLM metrics → report)
python main.py --model my_model --gpus 0,1,2,3,4,5,6,7
# 或独立运行各阶段:
python main.py --model my_model --phase precompute # SAM2 + DA3 + MegaSAM
python main.py --model my_model --phase gpu # GPU metrics (per-metric)
python main.py --model my_model --phase vlm # VLM metrics (API)
python main.py --model my_model --phase report # Aggregate report
```
**注意:**上述 pipeline 涵盖了 22 项指标中的 21 项。`visual_plausibility` 是个例外——它在**单独的 `wbench-vp` 环境**中运行(在[快速开始](#-quick-start)中设置):
```
conda activate wbench-vp
python tools/run_visual_plausibility.py --model my_model # uses all available GPUs
```
3. 结果:`work_dirs//evaluation/{metric}/case_{id}.json` + `report.json`
```
# 运行特定的 metrics(按名称或维度)
python main.py --model my_model --phase gpu --metrics hpsv3_quality
python main.py --model my_model --phase gpu --metrics quality # all 6 video quality
python main.py --model my_model --phase gpu --metrics consistency # all consistency metrics
# 如果已经完成,则跳过 pre-computation
python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3
# 单个 video evaluation
python main.py --video video.mp4 --case data/cases/case_1.json
```
**维度**(`--metrics` 支持以下简写):
| 维度 | 指标 |
|:---|:---|
| `quality` | aesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality |
| `consistency` | background_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency |
| `interaction` | navigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence |
| `setting` | scene_adherence, subject_adherence |
| `physical` | visual_plausibility, causal_fidelity |
## 🔥 接入你的模型
WBench 支持 3 种具有不同控制接口的模型类型:
| 类型 | 输入 | 用例 | 状态 |
|:---|:---|:---:|:---:|
| **文本条件** | 文本 prompt + 首帧图像 | 289 (全部) | ✅ 已实现 |
| **相机条件** | 首帧图像 + 6-DoF 相机 pose | 158 (导航) | ✅ 已实现 |
| **动作条件** | 首帧图像 + 离散动作 | 158 (导航) | ✅ 已实现 |
### 文本条件模型
```
from src.models import get_model
# 可用:wan, kling, seedance(或注册您自己的)
model = get_model("wan")
# 从单个 case 生成 multi-turn video
result = model.generate_multi_turn(
case=case_dict,
output_path="work_dirs/wan/videos/case_1_combined.mp4",
data_root="data/",
)
```
每一轮:从交互中构建 prompt → 调用 I2V API → 提取最后一帧 → 下一轮。
设置 API 凭证:
```
export VIDEO_API_URL="https://your-video-api.com"
export VIDEO_API_KEY="your-key"
```
### 相机条件模型
Benchmark 的导航动作(W/A/S/D + 方向键)被转换为每轮的
`{move, yaw, pitch}` intent,然后再转换为 6-DoF 相机轨迹。子类化
`CameraConditionedModel` 并实现一个 hook —— 用例解析、动作转 pose
转换以及视频写入都会为你自动处理:
```
from src.models.camera import CameraConditionedModel
class MyWorldModel(CameraConditionedModel):
def generate_with_poses(self, image, poses, video_length, **kw):
# image: first-frame path; poses: {"": {"extrinsic": 4x4, "K": 3x3}, ...}
# return: list of `video_length` BGR uint8 frames
return my_model.infer(image, poses, video_length)
MyWorldModel("mymodel").generate_multi_turn(case_dict,
"work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")
```
pose 约定(轴、速度、内参)位于 `src/models/camera/poses.py`
中——将其复制并适配到你的模型;导航指标会标准化比例,因此
重要的是匹配每个动作的 *intent*。快速查看一个用例:
```
python -m src.models.camera.demo --case data/cases/case_1.json # prints poses + renders a preview
```
### 动作条件模型
两种风格,均来自相同的每轮导航计划:
**程序化控制器**(例如 Matrix-Game-3)。子类化 `ActionConditionedModel`
并实现 `generate_with_actions`。每个动作都带有原始按键 `token`
和 MG3 风格的 `{keyboard, mouse}` 张量:
```
from src.models.action import ActionConditionedModel
class MyActionModel(ActionConditionedModel):
def generate_with_actions(self, image, actions, video_length, **kw):
# actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...]
return my_model.infer(image, actions, video_length)
MyActionModel("mymodel").generate_multi_turn(case_dict,
"work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")
```
```
python -m src.models.action.demo --case data/cases/case_1.json # prints actions + renders a preview
```
**Web 产品**(例如 Project Genie、Happy Oyster)——无权重/API;由
浏览器自动化 + 模拟按键驱动。详见
[`src/models/action/web/`](src/models/action/web/README.md)。
## 🤖 Claude Code Skills
如果你使用 [Claude Code](https://claude.com/claude-code),此代码库自带了
驱动完整工作流的 skills——只需用自然语言提问,Claude
就会运行正确的命令:
| Skill | 触发词 | 功能说明 |
|:---|:---|:---|
| `wbench-generate` | "generate kling videos" | 在数据集上运行 `generate.py` → `work_dirs//videos/` |
| `wbench-evaluate` | "evaluate kling3" | 运行 4 阶段 `main.py` pipeline(预计算 → gpu → vlm → 报告) |
| `wbench-submit` | "package my model for submission" | 构建 `meta.json` / `turns.json` 捆绑包并上传至 HuggingFace |
| `genie3` / `happy` | "run case_5 on genie3" | Web 产品的浏览器自动化([详情](src/models/action/web/README.md)) |
Skills 位于 `.claude/skills/`(以及 `src/models/action/web/.claude/skills/`)中,并且
当你在 Claude Code 中打开代码库时会自动发现。
## 📋 TODO
- [x] 文本条件模型生成(Wan、Kling、Seedance)
- [x] 带有交互式排行榜的主页
- [x] 在 HuggingFace 上发布数据集和权重
- [x] 相机条件模型生成示例
- [x] 动作条件模型生成示例
- [ ] 托管的提交和评估服务(提交视频,获取分数)
- [x] ArXiv 论文发布
## 📝 引用
如果你觉得我们的工作有用,请考虑引用:
```
@article{ying2026wbenchcomprehensivemultiturnbenchmark,
title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
journal={arXiv preprint arXiv:2605.25874},
year={2026}
}
```
## 🙏 致谢
本项目基于以下优秀的工作构建:
- [WorldScore](https://github.com/WorldScore/WorldScore) — 世界模型评估框架
- [VBench](https://github.com/Vchitect/VBench) — 视频质量指标
- [SAM2](https://github.com/facebookresearch/sam2) — 用于 mask 追踪的 Segment Anything Model 2
- [Depth-Anything-V3](https://github.com/DepthAnything/Depth-Anything-V3) — 单目深度估计
- [MegaSAM](https://github.com/mega-sam/mega-sam) — 相机 pose 估计
- [DreamSim](https://github.com/ssundaram21/dreamsim) — 感知相似度指标
- [HPSv3](标签:AI评测, IaC 扫描, 世界模型, 人工智能, 多轮交互, 用户模式Hook绕过, 视频生成, 逆向工具