菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-19
📄 Abstract - Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at this https URL.

顶级标签: multi-modal computer vision model training
详细标签: 3d scene understanding video diffusion models spatial reasoning latent world simulator multimodal fusion 或 搜索:

生成模型懂空间:释放隐式三维先验用于场景理解 / Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding


1️⃣ 一句话总结

这篇论文提出了一种新方法,通过挖掘大规模视频生成模型中隐含学习到的三维结构和物理规律知识,来增强多模态大语言模型的空间感知与推理能力,无需依赖稀缺的三维标注数据。

源自 arXiv: 2603.19235