Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

📄 Abstract - Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.

在分叉点定位信用：基于路径条件的自蒸馏方法提升大语言模型推理能力 / Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

1️⃣ 一句话总结

本文提出了一种名为“回溯自蒸馏”的新方法，通过让模型在训练过程中参考同一批中成功的推理路径，而不是仅仅依赖最终答案，从而更精准地识别并强化推理链条中导致成功的关键决策点，显著提升了数学和代码推理任务的性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要