菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-22
📄 Abstract - Video-ToC: Video Tree-of-Cue Reasoning

Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at this https URL.

顶级标签: video llm reinforcement learning
详细标签: video understanding hallucination visual cue localization reward mechanism tree-of-cue reasoning 或 搜索:

视频线索推理树:一种增强视频理解的树状推理框架 / Video-ToC: Video Tree-of-Cue Reasoning


1️⃣ 一句话总结

为了克服现有视频大模型在复杂视频理解中推理能力弱且容易产生幻觉的问题,本文提出了Video-ToC框架,它通过树状结构的视觉线索定位、动态调整奖励的强化学习以及自动构建训练数据集,使模型能够根据视频内容进行更精细、更可靠的推理。

源自 arXiv: 2604.20473