VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

📄 Abstract - VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at this https URL.

VISTA-Bench：视觉语言模型真的能像理解纯文本一样好地理解图像中的文本吗？ / VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

1️⃣ 一句话总结

这篇论文提出了一个名为VISTA-Bench的新基准测试，发现当前主流视觉语言模型在处理图像中的文本时，性能明显低于处理语义相同的纯文本，揭示了模型在跨模态统一理解上存在显著缺陷。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要