Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

📄 Abstract - Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

少噪声，多表达：通过指令净化实现推理的强化学习 / Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

1️⃣ 一句话总结

这项研究提出了一种名为LENS的新框架，它通过识别并清除指令中的干扰性词语来提升大语言模型在强化学习中的推理效率，从而在复杂任务中实现更快的训练速度和更好的性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要