菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-02
📄 Abstract - Enhancing Multi-Image Understanding through Delimiter Token Scaling

Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.

顶级标签: multi-modal natural language processing model training
详细标签: vision-language models delimiter tokens multi-image understanding cross-image leakage parameter scaling 或 搜索:

通过分隔符令牌缩放增强多图像理解能力 / Enhancing Multi-Image Understanding through Delimiter Token Scaling


1️⃣ 一句话总结

本文提出了一种无需额外训练或推理成本的方法,通过缩放分隔符令牌的隐藏状态,有效阻止了大型视觉语言模型中不同图像间的信息泄露,从而显著提升了模型处理多图像、多文档和多表格任务的能力。

源自 arXiv: 2602.01984