菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-02
📄 Abstract - Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner

Recent reinforcement learning (RL) methods improve LLM reasoning by optimizing discrete Chain-of-Thought (CoT) generation; however, exploration in token space often suffers from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. To mitigate this issue, we propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), a framework that conducts exploration directly in a continuous latent space, where latent variables encode semantic-level reasoning trajectories. By modeling exploration via guided diffusion, multi-step denoising distributes stochasticity and preserves multiple coexisting solution modes without mutual suppression. Furthermore, by decoupling latent-space exploration from text-space generation, we show that latent diffusion-based optimization is more effective than text-space policy optimization alone, while a complementary text policy provides additional gains when combined with latent exploration. Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL baselines, with absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning, highlighting diffusion-based latent RL as a principled alternative to discrete token-level RL for reasoning.

顶级标签: llm reinforcement learning model training
详细标签: latent diffusion reasoning diversity preservation chain-of-thought exploration 或 搜索:

超越模式激发:通过潜在扩散推理器实现多样性保持的强化学习 / Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner


1️⃣ 一句话总结

这篇论文提出了一种名为LaDi-RL的新方法,它通过在一个连续的潜在空间中进行扩散引导的探索来优化大语言模型的推理过程,有效避免了传统方法中因强化学习导致思维链多样性下降的问题,从而在代码生成和数学推理任务上取得了更好的性能。

源自 arXiv: 2602.01705