菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-19
📄 Abstract - LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

顶级标签: llm reinforcement learning model training
详细标签: policy optimization advantage estimation pairwise preference mathematical reasoning reinforcement learning from human feedback 或 搜索:

LambdaPO:一种用于推理语言模型的Lambda风格策略优化方法 / LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models


1️⃣ 一句话总结

本文提出了一种名为LambdaPO的新方法,通过将原本简单的群体平均奖励改进为两两轨迹之间的精细比较,并结合语义密度奖励,从而让大语言模型在数学推理和问答任务中得到更有效的优化,比现有方法表现更好。

源自 arXiv: 2605.19416