📄
Abstract - Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models
Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (this https URL)
基于可验证奖励的大型视觉语言模型中视觉表征与强化学习的桥接 /
Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models
1️⃣ 一句话总结
这项研究提出了一种名为KAWHI的即插即用奖励重加权机制,通过将关键视觉区域信息融入强化学习优化过程,有效解决了现有方法中视觉与推理步骤脱节的问题,从而显著提升了大型视觉语言模型在多模态推理任务上的性能。