📄
Abstract - Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.
能力不等于可解释性:视觉基础模型的人类可解释性 /
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
1️⃣ 一句话总结
本文通过两种心理物理学实验(局部定位和命名一致性)测量了六种主流视觉模型的可解释性,发现视觉基础模型(如DINOv2、CLIP等)虽然能力强,但其特征对人类而言反而不如早期有监督模型可解释,且可解释性与模型能力无关,而是取决于特征激活的局部性和粗粒度的语义对齐。