世界模型遇上语言模型:论具体推理与抽象推理的互补性 / World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
1️⃣ 一句话总结
本文提出了一种结合世界模型(用于生成具体视觉预测)与多模态大语言模型(用于抽象推理)的框架,通过自训练方法让模型自主判断何时启用视觉模拟并验证其结果,显著提升了在空间推理和开放域物理预测任务上的准确性和鲁棒性。
World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at this https URL.
世界模型遇上语言模型:论具体推理与抽象推理的互补性 / World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
本文提出了一种结合世界模型(用于生成具体视觉预测)与多模态大语言模型(用于抽象推理)的框架,通过自训练方法让模型自主判断何时启用视觉模拟并验证其结果,显著提升了在空间推理和开放域物理预测任务上的准确性和鲁棒性。
源自 arXiv: 2606.03603