MMGR:多模态生成式推理评估与基准 / MMGR: Multi-Modal Generative Reasoning
1️⃣ 一句话总结
这篇论文提出了一个名为MMGR的评估框架,用于系统测试视频和图像生成模型在物理、逻辑、空间等五大推理能力上的表现,发现当前主流模型在抽象推理和长程空间规划方面存在严重不足,为构建具备真正推理能力的生成模型指明了方向。
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
MMGR:多模态生成式推理评估与基准 / MMGR: Multi-Modal Generative Reasoning
这篇论文提出了一个名为MMGR的评估框架,用于系统测试视频和图像生成模型在物理、逻辑、空间等五大推理能力上的表现,发现当前主流模型在抽象推理和长程空间规划方面存在严重不足,为构建具备真正推理能力的生成模型指明了方向。
源自 arXiv: 2512.14691