基于原型的视觉-语言模型测试时自适应方法 / Prototype-Based Test-Time Adaptation of Vision-Language Models
1️⃣ 一句话总结
为了克服现有测试时自适应方法(如缓存方法)在速度和准确性上的不足,本文提出了一种基于类知识原型的全新方法,它通过动态加权融合每个测试样本的特征来累积知识,无需缓存和检索,从而在保持极高推理速度的同时,在15个图像识别和4个点云分析任务上取得了最优性能。
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
基于原型的视觉-语言模型测试时自适应方法 / Prototype-Based Test-Time Adaptation of Vision-Language Models
为了克服现有测试时自适应方法(如缓存方法)在速度和准确性上的不足,本文提出了一种基于类知识原型的全新方法,它通过动态加权融合每个测试样本的特征来累积知识,无需缓存和检索,从而在保持极高推理速度的同时,在15个图像识别和4个点云分析任务上取得了最优性能。
源自 arXiv: 2604.21360