CLIF:面向透明瓶颈模型的概念级影响函数 / CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models
1️⃣ 一句话总结
这篇论文提出了一种利用影响函数来提升深度学习模型可解释性的方法,不仅能找出对预测结果最有影响的训练样本(包括正面和负面),还首次在概念瓶颈模型中定位出关键概念,通过调整这些样本或概念即可改变模型行为,从而让模型的决策过程更加透明易懂。
In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.
CLIF:面向透明瓶颈模型的概念级影响函数 / CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models
这篇论文提出了一种利用影响函数来提升深度学习模型可解释性的方法,不仅能找出对预测结果最有影响的训练样本(包括正面和负面),还首次在概念瓶颈模型中定位出关键概念,通过调整这些样本或概念即可改变模型行为,从而让模型的决策过程更加透明易懂。
源自 arXiv: 2605.19848