WorldMark: A Unified Benchmark Suite for Interactive Video World Models

📄 Abstract - WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (this http URL), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

WorldMark：面向交互式视频世界模型的统一基准套件 / WorldMark: A Unified Benchmark Suite for Interactive Video World Models

1️⃣ 一句话总结

为解决当前交互式视频生成模型（如Genie、YUME等）因各自使用私有场景和轨迹进行评估而无法公平比较的问题，本文推出了首个统一基准WorldMark，它通过标准化的动作映射层、分层次测试用例集和模块化评估工具，使得不同模型能在相同场景和相同动作序列下进行公平比较，并配套了在线竞技平台。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要