菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

顶级标签: llm model training model evaluation
详细标签: autoformalization post-training data overlap supervised fine-tuning policy optimization 或 搜索:

SFT-GRPO数据重叠作为自动形式化的后训练超参数 / SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization


1️⃣ 一句话总结

这篇论文通过实验发现,在AI模型进行自动形式化任务的后训练中,让监督微调(SFT)和强化学习(GRPO)两个阶段使用完全不同的数据,能显著提升模型性能,而如果两个阶段使用完全相同的数据,强化学习阶段就变得几乎无效。

源自 arXiv: 2604.13515