DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

📄 Abstract - DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

DynamicVerse：一个物理感知的多模态4D世界建模框架 / DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

1️⃣ 一句话总结

这篇论文提出了一个名为DynamicVerse的新框架，它利用大型模型从普通网络视频中自动构建出包含精确三维几何、真实运动、物体分割和文字描述的大规模4D（三维+时间）动态世界数据集，从而帮助AI更准确地理解和模拟真实物理世界。

← 返回列表

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

获取最新论文摘要