Action Images: End-to-End Policy Learning via Multiview Video Generation

📄 Abstract - Action Images: End-to-End Policy Learning via Multiview Video Generation

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

动作图像：通过多视角视频生成的端到端策略学习 / Action Images: End-to-End Policy Learning via Multiview Video Generation

1️⃣ 一句话总结

这篇论文提出了一种名为‘动作图像’的新方法，它将机器人的动作转化为易于理解的多视角视频片段，从而让一个现成的视频生成模型能直接作为机器人策略来使用，无需额外模块，并在多项任务中取得了出色的零样本性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要