Decoupling Vision and Language: Codebook Anchored Visual Adaptation

📄 Abstract - Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

解耦视觉与语言：基于码本锚定的视觉适配 / Decoupling Vision and Language: Codebook Anchored Visual Adaptation

1️⃣ 一句话总结

这篇论文提出了一种名为CRAFT的轻量级方法，通过使用一个离散码本将视觉表示锚定在稳定的符号空间中，从而让大型视觉语言模型在不修改其他部分的情况下，高效地适应医学图像诊断等特定领域任务，并显著提升其性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要