LongCoT:长链条思维推理基准测试 / LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
1️⃣ 一句话总结
这篇论文提出了一个名为LongCoT的新基准测试,专门用来衡量AI模型在解决需要多步骤、长链条推理的复杂问题时的能力,结果显示当前最先进的模型在这方面的表现仍然很差。
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
LongCoT:长链条思维推理基准测试 / LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
这篇论文提出了一个名为LongCoT的新基准测试,专门用来衡量AI模型在解决需要多步骤、长链条推理的复杂问题时的能力,结果显示当前最先进的模型在这方面的表现仍然很差。
源自 arXiv: 2604.14140