通过一致性和标签解耦实现可解释的判别性文本表示 / Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
1️⃣ 一句话总结
本文提出了一种新的可解释文本分类方法,通过要求每个特征既能被不同标注者一致识别,又不与预测标签直接重复,从而生成清晰、可信且不易泄露标签信息的文本表示,实验证明该方法在保持分类性能的同时显著提升了特征的可审核性。
Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $\kappa$, and selects features by residual held-out predictive gain. A stylized analysis connects the $\kappa$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.
通过一致性和标签解耦实现可解释的判别性文本表示 / Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
本文提出了一种新的可解释文本分类方法,通过要求每个特征既能被不同标注者一致识别,又不与预测标签直接重复,从而生成清晰、可信且不易泄露标签信息的文本表示,实验证明该方法在保持分类性能的同时显著提升了特征的可审核性。
源自 arXiv: 2605.20693