Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

📄 Abstract - Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{this https URL}{Hugging Face}, and the code is released at \href{this https URL}{Github}.

超越长度扩展：融合广度与深度以优化生成式奖励模型 / Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

1️⃣ 一句话总结

这篇论文提出了一个名为Mix-GRM的新框架，通过结构化地结合广度推理（覆盖多维度原则）和深度推理（确保判断的实质性严谨性），而非简单地增加推理长度，来显著提升生成式奖励模型的评估性能，使其在多种任务上达到新的最佳水平。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要