菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-10
📄 Abstract - Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

顶级标签: llm model evaluation natural language processing
详细标签: self-interpretation interpretability adapters sparse autoencoders model scaling 或 搜索:

从可解释性人工产物中学习自我解释:在向量-标签对上训练轻量级适配器 / Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs


1️⃣ 一句话总结

这篇论文提出了一种新方法,通过给冻结的大型语言模型加装一个极简的“翻译器”(适配器),就能让模型可靠地解释自己的内部工作过程,并且这种方法的效果会随着模型变大而变得更好。

源自 arXiv: 2602.10352