VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

📄 Abstract - VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

VideoVLA：视频生成模型可作为通用机器人操作器 / VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

1️⃣ 一句话总结

这篇论文提出了VideoVLA方法，通过将大型视频生成模型改造为机器人操作器，使其能根据语言指令和当前图像，同时预测未来的动作序列和视觉结果，从而显著提升了机器人在新任务、新物体和新环境中的泛化能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要