菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-25
📄 Abstract - Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at this https URL.

顶级标签: computer vision systems model training
详细标签: 3d scene understanding occupancy prediction visual geometry priors gaussian primitives monocular depth 或 搜索:

将视觉几何先验泛化至稀疏高斯占据预测 / Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction


1️⃣ 一句话总结

这篇论文提出了一个名为GPOcc的新框架,它通过将先进的视觉几何模型提供的表面信息,巧妙地转化为三维空间内部的概率性体积表示,从而更高效、更准确地从单张图片或连续视频流中预测出整个场景的三维占据情况(即哪些地方有物体,哪些地方是空的)。

源自 arXiv: 2602.21552