并非所有错误样本都同等重要:大语言模型从合理推理中学习效果更佳 / Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning
1️⃣ 一句话总结
这篇论文提出了一种名为‘合理负样本’的新方法,通过专门生成看起来格式正确、推理过程合理但最终答案是错误的训练样本,来更有效地提升大语言模型在数学推理等任务上的表现,效果优于传统方法。
Learning from negative samples holds great promise for improving Large Language Model (LLM) reasoning capability, yet existing methods treat all incorrect responses as equally informative, overlooking the crucial role of sample quality. To address this, we propose Plausible Negative Samples (PNS), a method that synthesizes high-quality negative samples exhibiting expected format and structural coherence while ultimately yielding incorrect answers. PNS trains a dedicated model via reverse reinforcement learning (RL) guided by a composite reward combining format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation, generating responses nearly indistinguishable from correct solutions. We further validate PNS as a plug-and-play data source for preference optimization across three backbone models on seven mathematical reasoning benchmarks. Results demonstrate that PNS consistently outperforms other negative sample synthesis methods, achieving an average improvement of 2.03% over RL-trained models.
并非所有错误样本都同等重要:大语言模型从合理推理中学习效果更佳 / Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning
这篇论文提出了一种名为‘合理负样本’的新方法,通过专门生成看起来格式正确、推理过程合理但最终答案是错误的训练样本,来更有效地提升大语言模型在数学推理等任务上的表现,效果优于传统方法。
源自 arXiv: 2602.03516