菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - HDINO: A Concise and Efficient Open-Vocabulary Detector

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at this https URL.

顶级标签: computer vision model training multi-modal
详细标签: open-vocabulary detection object detection vision-language alignment transformer efficient training 或 搜索:

HDINO:一个简洁高效的开集词汇目标检测器 / HDINO: A Concise and Efficient Open-Vocabulary Detector


1️⃣ 一句话总结

这篇论文提出了一种名为HDINO的新型目标检测器,它通过创新的两阶段训练策略,无需依赖人工精细标注的数据集和复杂的跨模态特征提取,就能高效地识别训练时未见过的新类别物体,并在多个公开数据集上取得了优异的性能。

源自 arXiv: 2603.02924