菜单

🤖 系统
📄 Abstract - Think Visually, Reason Textually: Vision-Language Synergy in ARC

Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at this https URL.

顶级标签: multi-modal natural language processing computer vision
详细标签: abstract reasoning vision-language synergy arc-agi benchmark modality switching rule induction 或 搜索:

📄 论文总结

视觉思考,文本推理:ARC中的视觉-语言协同 / Think Visually, Reason Textually: Vision-Language Synergy in ARC


1️⃣ 一句话总结

这篇论文提出了一种结合视觉抽象和语言推理的协同方法,通过视觉辅助模式识别和语言确保规则精确执行,在抽象推理任务ARC-AGI上显著提升了AI模型的性能,为实现更接近人类智能的通用推理能力提供了新思路。


📄 打开原文 PDF