菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - 3D-DRES: Detailed 3D Referring Expression Segmentation

Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

顶级标签: computer vision natural language processing multi-modal
详细标签: 3d visual grounding referring expression segmentation vision-language understanding dataset instance segmentation 或 搜索:

3D-DRES:精细化的三维指代表达式分割 / 3D-DRES: Detailed 3D Referring Expression Segmentation


1️⃣ 一句话总结

这篇论文提出了一个名为3D-DRES的新任务和一个配套数据集DetailRefer,旨在通过将自然语言描述中的每个名词短语精确映射到三维场景中的对应物体部件,来实现比现有方法更精细的三维视觉-语言理解,并展示了该方法不仅能提升短语级分割精度,还能意外地改善传统的句子级三维指代表达式分割性能。

源自 arXiv: 2603.02896