VideoLatent:通过潜在自强制学习视频语言 / VideoLatent: Video-Language Learning via Latent Self-Forcing
1️⃣ 一句话总结
为了高效且低成本地提升多模态大模型对视频的理解与推理能力,本文提出了一种名为VideoLatent的新方法,它让模型在内部自动进行“潜在推理”,仅需视频、问题和答案这类简单数据即可训练,不仅性能全面超越现有模型,还将训练和推理的计算开销分别降低至原来的六分之一和六十八分之一。
Recent advancements in chain-of-thought (CoT) reasoning have shown promise in enhancing video understanding and reasoning capabilities of multimodal large language models (MLLMs). However, existing CoT-based MLLMs require labor-intensive CoT annotations and incur substantial training and inference overhead. While visual latent reasoning has emerged as a more efficient alternative, existing methods primarily focus on image tasks and heavily rely on additional supervision signals for visual latent generation (e.g., CoT traces, auxiliary images, or fine-grained annotations), limiting their scalability and transferability to video tasks. To bridge this gap, we introduce VideoLatent, a novel MLLM equipped with a latent injection module tailored for video understanding and reasoning. Specifically, VideoLatent learns to perform visual latent reasoning using a new latent self-forcing training paradigm, which comprises latent alignment and latent diversity objectives, and relies solely on standard video-question-answer triplets. Extensive experiments across 14 benchmarks demonstrate that our model consistently outperforms existing standard and latent MLLMs on general video understanding and complex video reasoning. Compared with Video-R1, our VideoLatent achieves superior computational efficiency, reducing training/inference overhead by $\sim$6$\times$/$\sim$68$\times$. Moreover, experiments demonstrate that our method has strong generalizability to different MLLM backbones and different model scales.
VideoLatent:通过潜在自强制学习视频语言 / VideoLatent: Video-Language Learning via Latent Self-Forcing
为了高效且低成本地提升多模态大模型对视频的理解与推理能力,本文提出了一种名为VideoLatent的新方法,它让模型在内部自动进行“潜在推理”,仅需视频、问题和答案这类简单数据即可训练,不仅性能全面超越现有模型,还将训练和推理的计算开销分别降低至原来的六分之一和六十八分之一。
源自 arXiv: 2606.22870