多模态大语言模型真的“看见”了吗?——强化其视觉注意力机制 / Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
1️⃣ 一句话总结
这篇论文发现当前的多模态大模型在推理时视觉注意力不集中且难以自我修正,导致错误累积,为此提出了一种名为SAYO的新模型,它通过强化学习奖励机制来引导模型更可靠地关注图像关键区域,从而在多种视觉推理任务上取得了更好的表现。
While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.
多模态大语言模型真的“看见”了吗?——强化其视觉注意力机制 / Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
这篇论文发现当前的多模态大模型在推理时视觉注意力不集中且难以自我修正,导致错误累积,为此提出了一种名为SAYO的新模型,它通过强化学习奖励机制来引导模型更可靠地关注图像关键区域,从而在多种视觉推理任务上取得了更好的表现。
源自 arXiv: 2602.08241