菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-16
📄 Abstract - Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

顶级标签: llm natural language processing model training
详细标签: vision language models binding shortcuts generalization interpretability cross-modal training 或 搜索:

视觉助力泛化:视觉数据如何纠正绑定捷径 / Seeing to Generalize: How Visual Data Corrects Binding Shortcuts


1️⃣ 一句话总结

这篇论文发现,给大语言模型加入视觉训练后,不仅能处理图像,还能提升其在纯文本任务(尤其是长文本信息检索)上的泛化能力,因为视觉训练打破了模型依赖位置捷径的坏习惯,迫使它学会更稳健的符号绑定机制。

源自 arXiv: 2602.15183