📄
Abstract - Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection
Few-shot object detection aims to detect novel object categories from only a few labeled examples, avoiding costly large-scale annotation. Recent prototype-based similarity learning approaches enable training-free adaptation by matching query features with class prototypes. However, they suffer from two fundamental limitations: (i) class confusion arising from inter-class similarity margin collapse, and (ii) insufficient visual cues for precise localization, as similarity scores capture only class-level semantic affinity while providing limited spatial information. To address these issues, we introduce two complementary components. Text-Anchored Semantic Mask (TSMa) leverages class-level text features as semantic anchors to identify semantically aligned channels through channel-wise interaction between visual and text features. By suppressing style-induced spurious responses and emphasizing class-intrinsic signals, TSMa enlarges inter-class similarity margins and mitigates class confusion. We further propose Stage-Aligned Hierarchical Autoregressive Regression (SHARe), which reformulates localization as a hierarchical autoregressive process that progressively refines bounding boxes across multiple stages. SHARe leverages the layer-wise characteristics of ViT representations by aligning feature abstraction levels with regression stages: deeper layers guide early coarse localization, while shallower layers rich in edge and texture cues refine spatial details in later stages. Experiments on COCO demonstrate a new state of the art, outperforming the previous best by +10.1 nAP, with extensive analysis validating each component. The code is available at this https URL.
重新思考基于原型的相似性学习以实现小样本目标检测 /
Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection
1️⃣ 一句话总结
本文针对小样本目标检测中基于原型的相似性学习方法,提出了两个创新组件:文本锚定语义掩码(TSMa)通过文本特征引导视觉特征,解决了类别间相似度过高导致的混淆问题;阶段对齐层次化自回归回归(SHARe)则通过分层逐步细化边界框,提升了定位精度,最终在COCO数据集上实现了领先性能。