菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-04
📄 Abstract - Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in this https URL.

顶级标签: multi-modal benchmark
详细标签: vision-language models chronological reasoning shortcut bias evaluation 或 搜索:

看见时间:视觉语言模型中的时间顺序推理与捷径偏误基准测试 / Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models


1️⃣ 一句话总结

该论文构建了一套专门用于评估视觉语言模型时间顺序推理能力的新基准,通过多组不同难度的图像数据集和跨模态匹配任务,发现模型常常依赖颜色等表面线索而非真正的时间逻辑来作出判断,揭示了当前模型在理解图像先后顺序上存在的严重局限。

源自 arXiv: 2606.05702