菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-07
📄 Abstract - Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

顶级标签: computer vision multi-modal model training
详细标签: 3d scene generation video diffusion geometric latents reconstruction models point clouds 或 搜索:

Gen3R:三维场景生成与前馈式重建的融合 / Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction


1️⃣ 一句话总结

这篇论文提出了一个名为Gen3R的新方法,它巧妙地将先进的3D重建模型和视频生成模型结合起来,能够根据一张或多张图片,一次性自动生成高质量的三维场景视频及其对应的几何结构(如深度图和点云),并在实验中取得了领先的效果。

源自 arXiv: 2601.04090