📄 论文总结
RynnVLA-002:统一的视觉-语言-动作与世界模型 / RynnVLA-002: A Unified Vision-Language-Action and World Model
1️⃣ 一句话总结
这篇论文提出了一个将视觉-语言-动作模型与世界模型相结合的统一框架,通过让两个模型相互增强,显著提升了机器人在模拟和真实环境中的任务成功率。
We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.
RynnVLA-002:统一的视觉-语言-动作与世界模型 / RynnVLA-002: A Unified Vision-Language-Action and World Model
这篇论文提出了一个将视觉-语言-动作模型与世界模型相结合的统一框架,通过让两个模型相互增强,显著提升了机器人在模拟和真实环境中的任务成功率。