菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-14
📄 Abstract - OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.

顶级标签: computer vision multi-modal 3d scene understanding
详细标签: open-vocabulary voxel grouping vision-language models training-free referring expression segmentation 或 搜索:

OpenVoxel:面向开放词汇3D场景理解的免训练体素分组与描述方法 / OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding


1️⃣ 一句话总结

这篇论文提出了一种名为OpenVoxel的免训练算法,它能够自动将3D场景中的稀疏体素聚合成有意义的物体组,并利用大语言模型为每个组生成文字描述,从而无需额外训练即可实现对复杂3D场景的开放词汇理解和分割。

源自 arXiv: 2601.09575