MSJoE:联合进化多模态大语言模型与采样器以实现高效长视频理解 / MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
1️⃣ 一句话总结
这篇论文提出了一种名为MSJoE的新方法,它通过让多模态大语言模型和一个轻量级的关键帧采样器协同学习和进化,智能地从长视频中筛选出少量最相关的画面进行理解,从而在显著提升回答准确率的同时,实现了对长视频的高效分析。
Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.
MSJoE:联合进化多模态大语言模型与采样器以实现高效长视频理解 / MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
这篇论文提出了一种名为MSJoE的新方法,它通过让多模态大语言模型和一个轻量级的关键帧采样器协同学习和进化,智能地从长视频中筛选出少量最相关的画面进行理解,从而在显著提升回答准确率的同时,实现了对长视频的高效分析。
源自 arXiv: 2602.22932