LEPO: Latent Reasoning Policy Optimization for Large Language Models

📄 Abstract - LEPO: Latent Reasoning Policy Optimization for Large Language Models

Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.

潜在推理策略优化：面向大语言模型的连续空间推理增强方法 / LEPO: Latent Reasoning Policy Optimization for Large Language Models

1️⃣ 一句话总结

本文提出一种名为LEPO的新框架，通过向大语言模型的潜在推理过程中注入可控随机性（利用Gumbel-Softmax技术），使得模型能在连续思维空间中探索多样化的推理路径，并直接应用强化学习优化这些潜在表示，从而显著提升推理性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要