菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-23
📄 Abstract - How Much 3D Do Video Foundation Models Encode?

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

顶级标签: video model evaluation computer vision
详细标签: 3d understanding video foundation models feature analysis benchmarking emergent properties 或 搜索:

视频基础模型编码了多少3D信息? / How Much 3D Do Video Foundation Models Encode?


1️⃣ 一句话总结

这篇论文通过一个通用框架评估了现有视频大模型对三维世界的理解能力,发现即使未经专门的3D数据训练,顶尖的视频生成模型也能展现出强大的、甚至超越专业3D模型的3D场景和物体认知能力。

源自 arXiv: 2512.19949