菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-04
📄 Abstract - Inferring Compositional 4D Scenes without Ever Seeing One

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

顶级标签: computer vision multi-modal model training
详细标签: 4d scene reconstruction compositional reasoning attention mechanisms monocular video dynamic objects 或 搜索:

无需见过真实场景,也能推断出组合式4D场景 / Inferring Compositional 4D Scenes without Ever Seeing One


1️⃣ 一句话总结

这篇论文提出了一种名为COM4D的新方法,它能够仅通过分析普通2D视频,就自动重建出包含多个静态和动态物体、且时空关系一致的完整4D(三维空间+时间)场景,而无需依赖任何现成的4D场景数据进行训练。


源自 arXiv: 2512.05272