Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

📄 Abstract - Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at this https URL

Cosmos策略：通过微调视频模型实现视觉运动控制与规划 / Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

1️⃣ 一句话总结

这篇论文提出了一种名为Cosmos Policy的简单方法，它通过直接在目标平台的机器人演示数据上对预训练的大型视频模型进行一次微调，无需修改模型结构，就能将其转变为一个能直接生成机器人动作、预测未来状态并进行规划的高性能机器人策略，在多项仿真和真实世界任务中取得了领先的性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要