Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

📄 Abstract - Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: this https URL

📄 论文总结

Video-R4：通过视觉反刍增强文本丰富视频的推理能力 / Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

1️⃣ 一句话总结

这篇论文提出了一种名为Video-R4的视频推理模型，它通过模拟人类反复观察关键区域的行为，迭代地放大和重新分析视频帧中的文本细节，从而显著提升了在文本密集视频问答任务中的准确性和泛化能力。

← 返回列表

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

获取最新论文摘要