菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

顶级标签: multi-modal model training computer vision
详细标签: world model spatial reasoning vision-language model distillation egocentric motion 或 搜索:

World2VLM:将世界模型的空间想象能力蒸馏到视觉语言模型中,用于动态空间推理 / World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning


1️⃣ 一句话总结

本文提出一种新训练框架World2VLM,通过让生成式世界模型在训练时“教”视觉语言模型如何预测视角变化后的场景,使后者在不增加运算负担的情况下,提升了动态空间推理能力。

源自 arXiv: 2604.26934