菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

顶级标签: computer vision model training systems
详细标签: 3d reconstruction long-context video memory architecture geometric foundation models video processing 或 搜索:

LoGeR:基于混合内存的长上下文几何重建 / LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory


1️⃣ 一句话总结

这篇论文提出了一种名为LoGeR的新架构,它通过一种创新的混合内存模块,能够高效、准确地将短视频的3D重建技术扩展到长达数千帧的极长视频序列,解决了长期重建中的尺度漂移和边界对齐难题,性能大幅超越现有方法。

源自 arXiv: 2603.03269