GradAlign:面向大语言模型强化学习的梯度对齐数据选择方法 / GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
1️⃣ 一句话总结
这篇论文提出了一种名为GradAlign的新方法,它通过选择那些能让模型训练梯度方向与一小部分可信验证集梯度方向一致的数据,来为大语言模型的强化学习自动筛选高质量训练问题,从而在各种困难数据场景下实现更稳定、更高效的模型优化。
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at this https URL
GradAlign:面向大语言模型强化学习的梯度对齐数据选择方法 / GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
这篇论文提出了一种名为GradAlign的新方法,它通过选择那些能让模型训练梯度方向与一小部分可信验证集梯度方向一致的数据,来为大语言模型的强化学习自动筛选高质量训练问题,从而在各种困难数据场景下实现更稳定、更高效的模型优化。
源自 arXiv: 2602.21492