MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

📄 Abstract - MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

MVISTA-4D：用于机器人操作的具有测试时动作推理能力的视角一致四维世界模型 / MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

1️⃣ 一句话总结

这篇论文提出了一个名为MVISTA-4D的新型机器人世界模型，它能够仅凭单视角的RGBD图像，就生成几何一致、多视角的未来场景动态，并通过一种创新的测试时动作优化方法，将这些预测的未来转化为机器人可执行的精确动作。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要