📄
Abstract - Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.
圆球还是法棍?关于任务拓扑结构、长度泛化以及推理轨迹益处的研究 /
Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
1️⃣ 一句话总结
这项研究通过大规模逻辑推理数据集发现,生成中间推理步骤的AI模型在处理步骤少但类型多的‘宽浅’任务时表现优异,但在处理步骤多但类型单一的‘窄深’任务时,其泛化能力会显著下降,揭示了此类模型固有的优势与局限。