用于机器人控制的因果世界建模 / Causal World Modeling for Robot Control
1️⃣ 一句话总结
这篇论文提出了一种名为LingBot-VA的新型机器人学习框架,它通过结合视频世界模型和视觉语言预训练,让机器人能够理解动作与视觉变化之间的因果关系,从而自主预测未来画面并高效执行复杂的长周期操作任务。
This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
用于机器人控制的因果世界建模 / Causal World Modeling for Robot Control
这篇论文提出了一种名为LingBot-VA的新型机器人学习框架,它通过结合视频世界模型和视觉语言预训练,让机器人能够理解动作与视觉变化之间的因果关系,从而自主预测未来画面并高效执行复杂的长周期操作任务。
源自 arXiv: 2601.21998