Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

📄 Abstract - Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

再看一眼：多模态大语言模型中无需训练的证据高亮方法 / Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

1️⃣ 一句话总结

这篇论文提出了一种名为‘Look Twice’的无需训练的方法，通过分析模型自身的注意力模式来识别并高亮图像和文本中的关键证据，从而显著提升了多模态大模型在回答知识密集型问题时的准确性和可靠性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要