菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-23
📄 Abstract - SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards

Video MLLMs often struggle with fine-grained spatio-temporal reasoning, sometimes generating correct answers based on irrelevant frames or objects. Although outputting spatio-temporal evidence during reasoning is a promising direction, existing RL frameworks typically rely on geometry-only (IoU) rewards, which can be sensitive to boundary perturbations and overlook semantic alignment. To address this, we propose Semantic Evidence Reward (SER), which reformulates spatio-temporal evidence grounding as a constrained verification task. Instead of computing pixel-level overlap, SER uses a referee VLM as a local checker to evaluate model-generated evidence claims across two dimensions: relevance and localization quality, combined with a temporal penalty. This design reduces the reliance on dense box annotations and enables training directly on standard video QA data. On the V-STAR benchmark, SER achieves 49.6% mLGM, improving by 3.0 points over the strong evidence-grounded baseline Open-o3-Video, demonstrating its potential in enhancing both answer accuracy and evidence grounding.

顶级标签: multi-modal reinforcement learning video
详细标签: video reasoning spatio-temporal grounding semantic reward evidence evaluation video question answering 或 搜索:

SER:利用语义证据奖励学习视频推理的时空定位 / SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards


1️⃣ 一句话总结

本文提出了一种名为语义证据奖励(SER)的新方法,通过让视觉语言模型像“裁判”一样检查视频推理中生成的关键证据是否符合语义相关性、定位准确性和时间合理性,从而提升模型在复杂视频中定位关键物体和时刻的能力,避免了传统仅依赖边界框重合度的评价方式带来的问题。

源自 arXiv: 2606.24726