菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - RISE-Video: Can Video Generators Decode Implicit World Rules?

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

顶级标签: video generation benchmark model evaluation
详细标签: reasoning benchmark text-to-video multimodal evaluation world rules temporal consistency 或 搜索:

RISE-Video:视频生成器能解码隐含的世界规则吗? / RISE-Video: Can Video Generators Decode Implicit World Rules?


1️⃣ 一句话总结

这篇论文提出了一个名为RISE-Video的评测基准,旨在评估视频生成模型是否真正理解并遵循物理世界和常识中的隐含规则,而不仅仅是生成好看的画面,结果发现现有模型在这方面普遍存在不足。

源自 arXiv: 2602.05986