菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-19
📄 Abstract - From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

顶级标签: machine learning computer vision model training
详细标签: vision-language models perception vs reasoning post-training reinforcement learning curriculum learning 或 搜索:

从看到思考:解耦感知与推理提升视觉语言模型的后训练效果 / From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models


1️⃣ 一句话总结

这篇论文发现,当前视觉语言模型在视觉任务上的主要瓶颈在于视觉感知能力不足,而非推理能力本身,因此提出将训练过程分解为视觉感知、视觉推理和文本推理三个独立阶段,并证明这种分阶段训练方法能显著提高模型准确率、缩短推理链条,其效果优于传统混合训练和单一难度的课程学习。

源自 arXiv: 2605.20177