后训练如何塑造生物推理模型 / How Post-Training Shapes Biological Reasoning Models
1️⃣ 一句话总结
本文通过对比实验发现,针对生物数据的推理模型在后期训练中,不同阶段(持续预训练、监督微调、强化学习)对领域内和领域外性能的影响各不相同,其中强化学习能在监督微调导致的过度专业化后部分恢复泛化能力,因此最佳策略是减少监督微调、增加强化学习投入。
Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.
后训练如何塑造生物推理模型 / How Post-Training Shapes Biological Reasoning Models
本文通过对比实验发现,针对生物数据的推理模型在后期训练中,不同阶段(持续预训练、监督微调、强化学习)对领域内和领域外性能的影响各不相同,其中强化学习能在监督微调导致的过度专业化后部分恢复泛化能力,因此最佳策略是减少监督微调、增加强化学习投入。
源自 arXiv: 2606.16517