4DVLT:基于世界线的动态场景视觉语言跟踪 / 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking
1️⃣ 一句话总结
本文提出了一种以“世界线”为核心的三维动态场景理解方法,通过将语言指令、物体身份、三维运动和二维多视角投影关联起来,并构建了大型基准数据集和高效的跟踪模型,在复杂动态场景中显著提升了目标定位和轨迹恢复的准确性。
4D dynamic scene understanding requires grounding language to a persistent worldline that binds identity, metric 3D motion, and synchronized multi-view 2D projections. Existing paradigms capture only part of this structure: large multimodal models reason over rich visual evidence but rarely preserve metric topology, while vision-language tracking remains tied to fragmented 2D or 3D outputs and local continuation. We therefore introduce \textbf{4DVLT}, a worldline-centered task for instruction-conditioned 4D dynamic scene understanding in fully observed multi-view video, and \textbf{Instruct-4D}, a benchmark with 129.4K question-answer pairs, 64.7K target entities, 851 scenes, and 9 reasoning-oriented query types. To address this setting, we present \textbf{4DTrack}, which casts instruction-conditioned tracking as graph-conditioned worldline inference through an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration. On Instruct-4D, 4DTrack-Qwen3.5-9B reaches 62.68 $\mathrm{TGA}_{\mathrm{Top1}}$ and surpasses the best adapted VLT baseline by 19.62 points. These results show that worldline-centered modeling improves both target grounding and recovered worldline quality. The project page is available at this https URL.
4DVLT:基于世界线的动态场景视觉语言跟踪 / 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking
本文提出了一种以“世界线”为核心的三维动态场景理解方法,通过将语言指令、物体身份、三维运动和二维多视角投影关联起来,并构建了大型基准数据集和高效的跟踪模型,在复杂动态场景中显著提升了目标定位和轨迹恢复的准确性。
源自 arXiv: 2606.22631