菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-08
📄 Abstract - VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

顶级标签: multi-modal model training model evaluation
详细标签: video understanding chain-of-thought reasoning efficiency video qa confidence-based inference 或 搜索:

VideoAuto-R1:通过“思考一次,回答两次”实现视频自动推理 / VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice


1️⃣ 一句话总结

这篇论文提出了一种新的视频理解框架VideoAuto-R1,它通过‘先给出初步答案,再根据必要进行推理’的智能策略,在保持高准确率的同时,大幅提升了处理效率,减少了不必要的复杂推理步骤。

源自 arXiv: 2601.05175