菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-26
📄 Abstract - SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

顶级标签: multi-modal model training natural language processing
详细标签: positional embedding 3d vision-language models spherical coordinates spatial perception point-cloud 或 搜索:

SoPE:基于球坐标的位置嵌入,用于增强3D大视觉语言模型的空间感知能力 / SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs


1️⃣ 一句话总结

这篇论文提出了一种名为SoPE的新方法,通过将三维点云数据映射到球坐标系来改进3D多模态模型的位置编码,使其能更好地理解和表达物体的空间位置与方向,从而提升了模型在3D场景理解任务上的性能。

源自 arXiv: 2602.22716