📄
Abstract - M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM
Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.
M^3:稠密匹配与多视图基础模型结合的单目高斯溅射SLAM /
M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM
1️⃣ 一句话总结
这项研究提出了一种名为M^3的新方法,它通过在多视图基础模型中增加一个专门的匹配模块来获取更精细的像素对应关系,并将其集成到一个鲁棒的单目SLAM系统中,从而在仅使用普通单目视频的情况下,显著提升了三维场景重建的精度和相机位姿估计的准确性。