菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-02
📄 Abstract - $A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

顶级标签: computer vision model evaluation
详细标签: self-supervised learning vision transformers attention localization foreground object detection representation learning 或 搜索:

A²:更小的自监督视觉Transformer比更大的模型定位更精准 / $A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones


1️⃣ 一句话总结

这篇论文发现,在自监督预训练的视觉Transformer中,较小的模型生成的注意力图能更准确地定位图像中的主要物体,而较大的模型虽然能提取更丰富的特征但定位能力较差;因此,作者提出A²方法,通过用小模型定位物体并裁剪图像,再用大模型提取裁剪后的特征,从而兼具两者的优势,在不额外训练的情况下显著提升了分类鲁棒性。

源自 arXiv: 2606.03148