菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-20
📄 Abstract - InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.

顶级标签: llm model training theory
详细标签: credit assignment reasoning reinforcement learning supervised fine-tuning mathematical reasoning 或 搜索:

干预训练:解决大语言模型推理中的信用分配问题 / InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning


1️⃣ 一句话总结

本文提出了一种名为“干预训练”的新方法,通过让大语言模型自我验证并生成单步修正指令,来精准定位和纠正推理轨迹中的首个错误,从而有效解决了强化学习中长期存在的信用分配问题,显著提升了模型在复杂数学推理等任务上的性能。


2️⃣ 论文创新点

1. 干预训练范式

2. 自我提出的干预

3. 两阶段训练流程


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2601.14209