📄 论文总结
SPHINX:一种用于视觉感知与推理的合成环境 / SPHINX: A Synthetic Environment for Visual Perception and Reasoning
1️⃣ 一句话总结
这篇论文提出了一个名为SPHINX的合成视觉推理测试平台,通过生成包含对称检测、空间推理等25类任务的谜题来评估模型能力,发现当前最先进模型表现远低于人类水平,并验证了基于可验证奖励的强化学习方法能有效提升模型在多模态推理任务上的准确率。
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
SPHINX:一种用于视觉感知与推理的合成环境 / SPHINX: A Synthetic Environment for Visual Perception and Reasoning
这篇论文提出了一个名为SPHINX的合成视觉推理测试平台,通过生成包含对称检测、空间推理等25类任务的谜题来评估模型能力,发现当前最先进模型表现远低于人类水平,并验证了基于可验证奖励的强化学习方法能有效提升模型在多模态推理任务上的准确率。