ViThinker:通过动态感知查询实现主动视觉语言推理 / ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
1️⃣ 一句话总结
这篇论文提出了一个名为ViThinker的新框架,它让视觉语言模型能像人一样主动‘思考’和‘观察’,在推理过程中根据需要动态生成查询来获取关键视觉信息,从而显著提升了复杂视觉推理任务的准确性和效率。
Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.
ViThinker:通过动态感知查询实现主动视觉语言推理 / ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
这篇论文提出了一个名为ViThinker的新框架,它让视觉语言模型能像人一样主动‘思考’和‘观察’,在推理过程中根据需要动态生成查询来获取关键视觉信息,从而显著提升了复杂视觉推理任务的准确性和效率。
源自 arXiv: 2602.02873