余弦相似度误导:辅助损失重塑视觉语言模型,而非其潜在表示 / Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
1️⃣ 一句话总结
本文发现,在视觉语言模型中使用余弦相似度或均方误差作为辅助损失来优化潜在视觉推理,并不能真正提升模型答案的准确性——因为模型实际上会绕过这些潜在表示,辅助损失反而通过共享参数间接改写了语言模型本身,而非其预期的潜在变量。
Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.
余弦相似度误导:辅助损失重塑视觉语言模型,而非其潜在表示 / Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
本文发现,在视觉语言模型中使用余弦相似度或均方误差作为辅助损失来优化潜在视觉推理,并不能真正提升模型答案的准确性——因为模型实际上会绕过这些潜在表示,辅助损失反而通过共享参数间接改写了语言模型本身,而非其预期的潜在变量。
源自 arXiv: 2606.05753