WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
Abstract
WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.
Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.
Community
upload
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/wbench-a-comprehensive-multi-turn-benchmark-for-interactive-video-world-model-evaluation-8328-76f35fba
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WorldMark: A Unified Benchmark Suite for Interactive Video World Models (2026)
- PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media (2026)
- iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework (2026)
- Do Joint Audio-Video Generation Models Understand Physics? (2026)
- MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation (2026)
- Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation (2026)
- OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.25874 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 1
meituan-longcat/WBench
Spaces citing this paper 0
No Space linking this paper