菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

顶级标签: llm benchmark model evaluation
详细标签: chain-of-thought reasoning long-horizon evaluation language models 或 搜索:

LongCoT:长链条思维推理基准测试 / LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning


1️⃣ 一句话总结

这篇论文提出了一个名为LongCoT的新基准测试,专门用来衡量AI模型在解决需要多步骤、长链条推理的复杂问题时的能力,结果显示当前最先进的模型在这方面的表现仍然很差。

源自 arXiv: 2604.14140