符号化锚定揭示抽象视觉推理中的表征瓶颈 / Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
1️⃣ 一句话总结
该研究通过对比视觉-语言模型直接处理图像与大型语言模型处理从图像中提取的符号化输入,发现抽象视觉推理的主要瓶颈不在于模型自身的推理能力,而在于如何将视觉信息转化为有效的符号表征。
Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
符号化锚定揭示抽象视觉推理中的表征瓶颈 / Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
该研究通过对比视觉-语言模型直接处理图像与大型语言模型处理从图像中提取的符号化输入,发现抽象视觉推理的主要瓶颈不在于模型自身的推理能力,而在于如何将视觉信息转化为有效的符号表征。
源自 arXiv: 2604.21346