菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-14
📄 Abstract - Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.

顶级标签: llm reinforcement learning model training
详细标签: credit assignment self-distillation reasoning chain-of-thought verifiable reward 或 搜索:

在分叉点定位信用:基于路径条件的自蒸馏方法提升大语言模型推理能力 / Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning


1️⃣ 一句话总结

本文提出了一种名为“回溯自蒸馏”的新方法,通过让模型在训练过程中参考同一批中成功的推理路径,而不是仅仅依赖最终答案,从而更精准地识别并强化推理链条中导致成功的关键决策点,显著提升了数学和代码推理任务的性能。

源自 arXiv: 2606.15576