菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-03
📄 Abstract - Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

顶级标签: llm reinforcement learning model training
详细标签: diffusion language models trajectory-aware value head block-wise supervision compute efficiency 或 搜索:

读取轨迹,引导路径:面向扩散语言模型的轨迹感知强化学习 / Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models


1️⃣ 一句话总结

提出了一种名为CAPR的新型强化学习算法,通过巧妙利用扩散语言模型生成过程中的“去噪轨迹”信息(即各位置标记逐渐确定的过程),在不进行昂贵树搜索的情况下,实现类似树搜索的精细奖励分配,从而以更低的计算成本显著提升模型在数学推理等任务上的性能。

源自 arXiv: 2606.04396