📄
Abstract - Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.
基于认知科学的大语言模型推理能力分析框架 /
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
1️⃣ 一句话总结
该论文提出了一个基于认知科学的统一框架,通过分析28个认知要素来系统评估大语言模型的推理能力,并开发了测试时推理引导方法,在复杂问题上将性能提升高达66.7%。
2️⃣ 论文创新点
1. 认知要素分类体系
- 创新点:综合认知科学研究构建了包含28个认知要素的分类体系,涵盖推理不变性、元认知控制、推理表征和组织结构、转换操作等四个维度
- 区别/改进:为系统分析LLM推理能力提供了统一框架
- 意义:建立了认知科学与LLM研究之间的共享词汇,支持推理失败的系统诊断
2. 细粒度评估框架
- 创新点:引入精细化的评估框架,对192K条模型推理轨迹和54条人类思维轨迹进行大规模实证分析
- 区别/改进:首次大规模比较人类与LLM的推理模式差异
- 意义:揭示了模型在非结构化问题上认知要素利用不足的问题
3. 测试时推理引导
- 创新点:基于认知模式分析开发的自动搭建成功推理结构的方法
- 区别/改进:在复杂问题上将性能提升高达66.7%,同时保持结构化问题的基线性能
- 意义:证明模型拥有未自发表达的推理能力,理解认知行为模式可指导更有效的模型交互策略
3️⃣ 主要结果与价值
结果亮点
- 模型在非结构化问题上倾向于使用僵化的顺序处理策略,而人类表现出更多抽象和概念处理能力
- 通过分析1,598篇arXiv LLM推理论文和170K推理轨迹,发现模型在非结构化问题上认知要素利用不足
- 测试时推理引导在非结构化问题上性能提升66.7%,证明模型具有潜在但未充分部署的推理能力
实际价值
- 提供了系统诊断LLM推理失败的工具和方法
- 开发的有效干预措施可显著提升模型在复杂问题上的表现
- 为评估和提升LLM的广义推理能力奠定基础,避免优化所测量而非重要的东西
4️⃣ 术语表
- 认知要素:从认知科学中提取的28种具体推理能力元素,涵盖执行功能、推理表征和推理操作
- 元认知控制:选择和监控推理策略的执行功能,包括自我意识、策略选择和评估等高级认知监控机制
- 推理不变性:有效推理必须保持的基本属性,包括逻辑连贯性和组合性
- 测试时推理引导:一种针对性干预方法,通过提示模型遵循成功的认知行为序列来提升推理性能
- Marr的分析层次:理解复杂信息处理系统的框架,包括计算层(定义系统目标)、算法和表示层(指定处理过程和表示形式)
- 组合性:通过规则支配的组合从简单组件构建复杂思想的能力
- 产生性:从有限原语生成无限多新思想的生成能力
- 结构组织:指定元素如何相互关联的架构,包括序列组织和层次组织等形式
- 层次组织:通过父子关系嵌套概念的组织形式,能够将复杂整体分解为可管理部分
- 目标管理:在推理过程中设定、排序和动态调整目标的过程
- 认知负荷理论:该理论表明工作记忆的限制造成了严重的瓶颈;结构不良的信息会超出容量,而结构良好的信息有助于处理