菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

顶级标签: computer vision model evaluation machine learning
详细标签: vision transformer out-of-distribution linear probing intermediate layers distribution shift 或 搜索:

逐层逐模块:为ViT模型选择最优的分布外探测策略 / Layer by layer, module by module: Choose both for optimal OOD probing of ViT


1️⃣ 一句话总结

这篇论文研究发现,当视觉Transformer模型遇到与训练数据差异较大的新任务时,在模型中间层的特定模块(如前馈网络内部)进行特征提取,比在最终输出层或整个Transformer块输出处提取特征能获得更好的性能。

源自 arXiv: 2603.05280