菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-22
📄 Abstract - CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at this https URL .

顶级标签: multi-modal model training natural language processing
详细标签: vision-language models cross-attention efficient fusion image understanding video captioning 或 搜索:

CASA:通过自注意力实现交叉注意力,用于高效的视觉-语言融合 / CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion


1️⃣ 一句话总结

这篇论文提出了一种名为CASA的新方法,它通过巧妙地在交叉注意力层中引入文本自注意力机制,显著提升了视觉-语言模型在处理图像细节时的性能,同时保持了模型在处理长视频或对话时的高效性。

源自 arXiv: 2512.19535