菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-25
📄 Abstract - Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

顶级标签: computer vision multi-modal model evaluation
详细标签: few-shot classification vision-language models prototype mixing cross-modal alignment clip 或 搜索:

用于免训练少样本分类的跨模态原型对齐与混合方法 / Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification


1️⃣ 一句话总结

这篇论文提出了一种无需额外训练就能提升CLIP模型少样本图像分类性能的新方法,通过巧妙地将文本和图像信息混合并优化对齐,有效减少了图像背景噪声的干扰,在多个标准测试中取得了更好的效果。

源自 arXiv: 2603.24528