📄 论文总结
VideoSSR:视频自监督强化学习 / VideoSSR: Video Self-Supervised Reinforcement Learning
1️⃣ 一句话总结
本研究提出了一种名为VideoSSR的视频自监督强化学习框架,通过设计三种无需人工标注的自监督任务来生成高质量训练数据,有效提升了多模态大语言模型在多种视频理解任务上的性能,平均提升超过5%。
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at this https URL.
VideoSSR:视频自监督强化学习 / VideoSSR: Video Self-Supervised Reinforcement Learning
本研究提出了一种名为VideoSSR的视频自监督强化学习框架,通过设计三种无需人工标注的自监督任务来生成高质量训练数据,有效提升了多模态大语言模型在多种视频理解任务上的性能,平均提升超过5%。