菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-02
📄 Abstract - Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{this https URL}{Hugging Face}, and the code is released at \href{this https URL}{Github}.

顶级标签: llm model training model evaluation
详细标签: generative reward models chain-of-thought reasoning mechanisms reinforcement learning benchmark 或 搜索:

超越长度扩展:融合广度与深度以优化生成式奖励模型 / Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models


1️⃣ 一句话总结

这篇论文提出了一个名为Mix-GRM的新框架,通过结构化地结合广度推理(覆盖多维度原则)和深度推理(确保判断的实质性严谨性),而非简单地增加推理长度,来显著提升生成式奖励模型的评估性能,使其在多种任务上达到新的最佳水平。

源自 arXiv: 2603.01571