LAD:用于推理的优势分布学习 / LAD: Learning Advantage Distribution for Reasoning
1️⃣ 一句话总结
这篇论文提出了一种名为LAD的新方法,通过让AI模型学习并匹配‘优势分布’,而不是单纯追求最高奖励,来解决当前大模型在数学和代码推理中容易陷入单一思维、缺乏多样性的问题,从而在提升准确率的同时,也增加了答案的多样性。
Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.
LAD:用于推理的优势分布学习 / LAD: Learning Advantage Distribution for Reasoning
这篇论文提出了一种名为LAD的新方法,通过让AI模型学习并匹配‘优势分布’,而不是单纯追求最高奖励,来解决当前大模型在数学和代码推理中容易陷入单一思维、缺乏多样性的问题,从而在提升准确率的同时,也增加了答案的多样性。
源自 arXiv: 2602.20132