ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

📄 Abstract - ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

ReMiT：强化学习引导的中期训练用于迭代式大语言模型演进 / ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

1️⃣ 一句话总结

这篇论文提出了一种名为ReMiT的新方法，它利用强化学习调整后的模型来指导大语言模型在预训练后期的关键阶段，通过动态调整训练数据的权重来优先学习推理相关的知识，从而形成一个自我强化的循环，持续提升模型在数学、代码和通用推理等多方面的能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要