菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-16
📄 Abstract - RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

顶级标签: computer vision model training systems
详细标签: 3d scene recovery cross-frame attention redundancy removal transformer acceleration multi-view geometry 或 搜索:

RegimeVGGT:面向视觉几何基础变换器的逐层空间保持冗余移除 / RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer


1️⃣ 一句话总结

该论文提出了一种无需额外训练的方法RegimeVGGT,通过分析VGGT模型中不同层的作用(浅层缺乏跨视图结构、中层负责对齐、深层对几何冗余但对位姿重要),从而针对性地对每层进行非均匀压缩,在保持重建质量的同时实现6.7倍加速。

源自 arXiv: 2606.18439