📄
Abstract - A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at this https URL.
基于注意力特征自适应的对比学习框架在街景图像分类中的应用 /
A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
1️⃣ 一句话总结
这篇论文提出了一种名为CLIP-MHAdapter的轻量级改进方法,通过在预训练视觉语言模型CLIP上添加一个带有多头自注意力机制的小型网络模块,使其能更有效地捕捉街景图像中细粒度的局部特征,从而以较低的算力成本在多个街景属性分类任务上取得了领先或具有竞争力的准确率。