V-REX:通过问题链对探索性视觉推理进行基准测试 / V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
1️⃣ 一句话总结
这篇论文提出了一个名为V-REX的新评估体系,用于测试AI模型在需要多步骤探索的复杂视觉推理任务上的能力,它通过将推理过程分解为‘规划问题链’和‘跟随问题链’两个关键环节,来对现有先进模型进行精细化的评估,并发现它们在多步骤探索推理方面仍有很大提升空间。
While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.
V-REX:通过问题链对探索性视觉推理进行基准测试 / V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
这篇论文提出了一个名为V-REX的新评估体系,用于测试AI模型在需要多步骤探索的复杂视觉推理任务上的能力,它通过将推理过程分解为‘规划问题链’和‘跟随问题链’两个关键环节,来对现有先进模型进行精细化的评估,并发现它们在多步骤探索推理方面仍有很大提升空间。
源自 arXiv: 2512.11995