复杂推理的特征刻画、评估与优化 / Characterizing, Evaluating, and Optimizing Complex Reasoning
1️⃣ 一句话总结
该论文提出了一个统一的框架,通过引入宏观与微观的评估原则、将推理过程建模为有向无环图并构建相应的奖励模型,来刻画、评估和优化大型推理模型中的复杂推理过程,从而显著提升模型在各种任务上的表现。
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
复杂推理的特征刻画、评估与优化 / Characterizing, Evaluating, and Optimizing Complex Reasoning
该论文提出了一个统一的框架,通过引入宏观与微观的评估原则、将推理过程建模为有向无环图并构建相应的奖励模型,来刻画、评估和优化大型推理模型中的复杂推理过程,从而显著提升模型在各种任务上的表现。
源自 arXiv: 2602.08498