菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-12
📄 Abstract - More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at this https URL.

顶级标签: multi-modal model evaluation natural language processing
详细标签: vision-language models multi-image reasoning benchmark attention mechanisms data generation 或 搜索:

图像越多,问题越多?对视觉语言模型失败模式的控制性分析 / More Images, More Problems? A Controlled Analysis of VLM Failure Modes


1️⃣ 一句话总结

这篇论文通过构建一个新的多图像评测基准MIMIC,揭示了大型视觉语言模型在处理多张图像时普遍存在信息整合困难等问题,并提出了通过合成训练数据和优化注意力机制两种方法来显著提升其多图像理解能力。

源自 arXiv: 2601.07812