菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-10
📄 Abstract - Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

顶级标签: llm model training model evaluation
详细标签: post-training interpretability preference data learning signal data auditing 或 搜索:

训练后阶段的剖析:利用可解释性刻画数据并塑造学习信号 / Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal


1️⃣ 一句话总结

本文提出一种基于可解释性的数据驱动训练后优化方法,通过分析偏好数据中隐含的概念特征,让研究者能够明确识别并干预模型学到的行为(如过度风格化或谄媚),从而将原本黑箱式的奖励优化转变为可审计、可定制的学习信号塑造过程。

源自 arXiv: 2606.12360