菜单

🤖 系统
📄 Abstract - Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.

顶级标签: natural language processing multi-modal model training
详细标签: chain-of-thought vision-language models visual reasoning instruction tuning generalization 或 搜索:

重新审视视觉推理泛化中冗长思维链的必要性 / Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization


1️⃣ 一句话总结

这项研究发现,在训练视觉语言模型进行视觉推理时,使用简短且仅包含关键定位步骤的思维链数据,比使用冗长或包含图像操作的复杂思维链,能带来更好的泛化能力和最终性能。


📄 打开原文 PDF