用于语言模型通用推理的耦合变分强化学习 / Coupled Variational Reinforcement Learning for Language Model General Reasoning
1️⃣ 一句话总结
这篇论文提出了一种名为CoVRL的新方法,通过将变分推断和强化学习相结合,让语言模型在无需外部验证的情况下,更高效地生成逻辑连贯的推理过程,从而显著提升了数学和通用推理任务的表现。
While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
用于语言模型通用推理的耦合变分强化学习 / Coupled Variational Reinforcement Learning for Language Model General Reasoning
这篇论文提出了一种名为CoVRL的新方法,通过将变分推断和强化学习相结合,让语言模型在无需外部验证的情况下,更高效地生成逻辑连贯的推理过程,从而显著提升了数学和通用推理任务的表现。
源自 arXiv: 2512.12576