多样性感知的视图划分方法用于可扩展的VGGT / Diversity-aware View Partitioning for Scalable VGGT
1️⃣ 一句话总结
本文提出了一种无需训练、即插即用的方法,通过将图像视图按视觉差异和空间分散性划分为多样性均衡的组块,有效解决了VGGT模型在处理大量视图时因冗余信息导致性能下降的问题,从而在减少计算资源消耗的同时提升了三维重建和姿态估计的精度。
Geometry transformers such as VGGT achieve strong performance by jointly reasoning over multiple views with global attention. However, scaling them to large view collections remains challenging due to the quadratic cost of attention. Moreover, our empirical analysis reveals that the reconstruction quality in VGGT is sensitive to the distribution of viewpoints. Simply increasing the number of views without sufficient viewpoint diversity can even degrade performance, as redundant views introduce highly similar tokens that dilute informative geometric signals in the attention mechanism. Motivated by this observation, we propose a training-free and plug-and-play VGGT inference framework that organizes views into diversity-aware balanced chunks. The chunks are constructed through combinatorial graph partitioning over visual dissimilarity and spatial dispersion. This view organization allows the transformer to focus attention on geometrically informative views while reducing redundant attention interactions. To estimate spatial dispersion without full pose estimation, we approximate spatial relationships via a soft pose propagation strategy based on visual similarity from a small set of seed frames. Extensive experiments demonstrate improved performance in camera pose estimation, multi-view depth prediction, and 3D reconstruction while reducing memory usage and inference latency. Our framework also complements existing VGGT variants, enabling scalable multi-view reconstruction without sacrificing geometric fidelity.
多样性感知的视图划分方法用于可扩展的VGGT / Diversity-aware View Partitioning for Scalable VGGT
本文提出了一种无需训练、即插即用的方法,通过将图像视图按视觉差异和空间分散性划分为多样性均衡的组块,有效解决了VGGT模型在处理大量视图时因冗余信息导致性能下降的问题,从而在减少计算资源消耗的同时提升了三维重建和姿态估计的精度。
源自 arXiv: 2607.01885