简化的开放词汇人-物交互检测 / Streamlined Open-Vocabulary Human-Object Interaction Detection
1️⃣ 一句话总结
这篇论文提出了一个名为SL-HOI的新框架,它巧妙地利用单一视觉模型DINOv3的不同组件,无需额外融合语言模型,就能高效地检测图像中已知和未知的人与物体之间的交互动作,并在多个标准测试中取得了领先的性能。
Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at this https URL.
简化的开放词汇人-物交互检测 / Streamlined Open-Vocabulary Human-Object Interaction Detection
这篇论文提出了一个名为SL-HOI的新框架,它巧妙地利用单一视觉模型DINOv3的不同组件,无需额外融合语言模型,就能高效地检测图像中已知和未知的人与物体之间的交互动作,并在多个标准测试中取得了领先的性能。
源自 arXiv: 2603.27500