评估用于RLVR的参数高效方法 / Evaluating Parameter Efficient Methods for RLVR
1️⃣ 一句话总结
本研究首次系统评估了多种参数高效微调方法在强化学习与可验证奖励框架下的表现,发现DoRA等结构变体优于常用的LoRA,并揭示了某些初始化策略失败的原因,为选择高效微调方法提供了明确指导。
We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
评估用于RLVR的参数高效方法 / Evaluating Parameter Efficient Methods for RLVR
本研究首次系统评估了多种参数高效微调方法在强化学习与可验证奖励框架下的表现,发现DoRA等结构变体优于常用的LoRA,并揭示了某些初始化策略失败的原因,为选择高效微调方法提供了明确指导。
源自 arXiv: 2512.23165