URoPE:跨几何空间的通用相对位置嵌入 / URoPE: Universal Relative Position Embedding across Geometric Spaces
1️⃣ 一句话总结
这篇论文提出了一种名为URoPE的新型位置编码方法,它能让Transformer模型在二维图像、三维空间以及不同相机视角之间灵活理解物体的相对位置关系,从而在三维物体检测、深度估计等计算机视觉任务中显著提升性能,且无需额外参数。
Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: this https URL.
URoPE:跨几何空间的通用相对位置嵌入 / URoPE: Universal Relative Position Embedding across Geometric Spaces
这篇论文提出了一种名为URoPE的新型位置编码方法,它能让Transformer模型在二维图像、三维空间以及不同相机视角之间灵活理解物体的相对位置关系,从而在三维物体检测、深度估计等计算机视觉任务中显著提升性能,且无需额外参数。
源自 arXiv: 2604.18747