视觉助力泛化:视觉数据如何纠正绑定捷径 / Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
1️⃣ 一句话总结
这篇论文发现,给大语言模型加入视觉训练后,不仅能处理图像,还能提升其在纯文本任务(尤其是长文本信息检索)上的泛化能力,因为视觉训练打破了模型依赖位置捷径的坏习惯,迫使它学会更稳健的符号绑定机制。
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
视觉助力泛化:视觉数据如何纠正绑定捷径 / Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
这篇论文发现,给大语言模型加入视觉训练后,不仅能处理图像,还能提升其在纯文本任务(尤其是长文本信息检索)上的泛化能力,因为视觉训练打破了模型依赖位置捷径的坏习惯,迫使它学会更稳健的符号绑定机制。
源自 arXiv: 2602.15183