节点早,边晚:探究大型视觉语言模型中的图表表征 / Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
1️⃣ 一句话总结
这篇论文发现,大型视觉语言模型在处理图表时,能很快识别出节点信息,但理解节点间连线(如箭头)所代表的关系却很慢,这解释了为什么这类模型在理解图表逻辑关系时表现不佳。
Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
节点早,边晚:探究大型视觉语言模型中的图表表征 / Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
这篇论文发现,大型视觉语言模型在处理图表时,能很快识别出节点信息,但理解节点间连线(如箭头)所代表的关系却很慢,这解释了为什么这类模型在理解图表逻辑关系时表现不佳。
源自 arXiv: 2603.02865