Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

📄 Abstract - Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

视觉助力泛化：视觉数据如何纠正绑定捷径 / Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

1️⃣ 一句话总结

这篇论文发现，给大语言模型加入视觉训练后，不仅能处理图像，还能提升其在纯文本任务（尤其是长文本信息检索）上的泛化能力，因为视觉训练打破了模型依赖位置捷径的坏习惯，迫使它学会更稳健的符号绑定机制。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要