基于成对偏好奖励与群体多样性增强的优质开放式生成方法 / Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
1️⃣ 一句话总结
本文提出了一种无需标量奖励的强化学习方法PPR-GDE,通过成对比较来捕捉主观偏好,并在奖励信号中引入群体多样性指标,从而在开放式生成任务(如角色扮演)中既提升了对齐质量,又避免了模型输出单一、刻板的问题。
Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.
基于成对偏好奖励与群体多样性增强的优质开放式生成方法 / Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
本文提出了一种无需标量奖励的强化学习方法PPR-GDE,通过成对比较来捕捉主观偏好,并在奖励信号中引入群体多样性指标,从而在开放式生成任务(如角色扮演)中既提升了对齐质量,又避免了模型输出单一、刻板的问题。
源自 arXiv: 2605.18191