菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-20
📄 Abstract - LEPO: Latent Reasoning Policy Optimization for Large Language Models

Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.

顶级标签: llm reinforcement learning
详细标签: latent reasoning gumbel-softmax policy optimization stochastic sampling gradient estimation 或 搜索:

潜在推理策略优化:面向大语言模型的连续空间推理增强方法 / LEPO: Latent Reasoning Policy Optimization for Large Language Models


1️⃣ 一句话总结

本文提出一种名为LEPO的新框架,通过向大语言模型的潜在推理过程中注入可控随机性(利用Gumbel-Softmax技术),使得模型能在连续思维空间中探索多样化的推理路径,并直接应用强化学习优化这些潜在表示,从而显著提升推理性能。

源自 arXiv: 2604.17892