菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-05
📄 Abstract - VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended this http URL zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare this http URL, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.

顶级标签: computer vision machine learning object completion
详细标签: open-world detection open-vocabulary detection open-ended detection visual memory memory-guided refinement 或 搜索:

VL-SAM-v3:基于记忆引导的视觉先验实现开放世界目标检测 / VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection


1️⃣ 一句话总结

本文提出了VL-SAM-v3框架,通过从外部记忆中检索视觉范例生成稀疏和稠密两类视觉先验,并与原检测提示融合,从而让模型在开放世界环境下(包括已知类别列表和未知类别)更好地识别罕见、纹理模糊或背景杂乱的目标,在LVIS数据集上取得了显著提升。

源自 arXiv: 2605.03456