用于可解释与鲁棒模型训练的混合归因先验 / Hybrid Attribution Priors for Explainable and Robust Model Training
1️⃣ 一句话总结
这篇论文提出了一种新的归因先验提取框架(CAP),它能帮助小型语言模型更好地抓住细微的类别差异,并通过结合多种归因先验来提升模型的可解释性和抗干扰能力。
Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model's self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.
用于可解释与鲁棒模型训练的混合归因先验 / Hybrid Attribution Priors for Explainable and Robust Model Training
这篇论文提出了一种新的归因先验提取框架(CAP),它能帮助小型语言模型更好地抓住细微的类别差异,并通过结合多种归因先验来提升模型的可解释性和抗干扰能力。
源自 arXiv: 2512.14719