菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-20
📄 Abstract - Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $\kappa$, and selects features by residual held-out predictive gain. A stylized analysis connects the $\kappa$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

顶级标签: natural language processing llm model evaluation
详细标签: interpretability text representations label disentanglement feature discovery agreement measurement 或 搜索:

通过一致性和标签解耦实现可解释的判别性文本表示 / Interpretable Discriminative Text Representations via Agreement and Label Disentanglement


1️⃣ 一句话总结

本文提出了一种新的可解释文本分类方法,通过要求每个特征既能被不同标注者一致识别,又不与预测标签直接重复,从而生成清晰、可信且不易泄露标签信息的文本表示,实验证明该方法在保持分类性能的同时显著提升了特征的可审核性。

源自 arXiv: 2605.20693