面向时序定位视频语言模型的因子化学习 / Factorized Learning for Temporally Grounded Video-Language Models
1️⃣ 一句话总结
这篇论文提出了一个名为D²VLM的新框架,通过将视频理解中的时序定位和文本生成两个任务解耦并强调其依赖关系,并引入一种新的因子化偏好优化算法,显著提升了模型对视频中事件进行精准时间定位和可靠回答的能力。
Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at this https URL.
面向时序定位视频语言模型的因子化学习 / Factorized Learning for Temporally Grounded Video-Language Models
这篇论文提出了一个名为D²VLM的新框架,通过将视频理解中的时序定位和文本生成两个任务解耦并强调其依赖关系,并引入一种新的因子化偏好优化算法,显著提升了模型对视频中事件进行精准时间定位和可靠回答的能力。
源自 arXiv: 2512.24097