用于语义可泛化规划的接地世界模型 / Grounded World Model for Semantically Generalizable Planning
1️⃣ 一句话总结
这篇论文提出了一种名为GWM的新方法,它通过将视觉、语言和动作信息对齐到一个统一的表示空间中,让机器人能够直接根据自然语言指令来预测和规划动作,从而在遇到从未见过的物体或环境描述时,依然能很好地完成任务。
In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.
用于语义可泛化规划的接地世界模型 / Grounded World Model for Semantically Generalizable Planning
这篇论文提出了一种名为GWM的新方法,它通过将视觉、语言和动作信息对齐到一个统一的表示空间中,让机器人能够直接根据自然语言指令来预测和规划动作,从而在遇到从未见过的物体或环境描述时,依然能很好地完成任务。
源自 arXiv: 2604.11751