EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

📄 Abstract - EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

EXPO-FT：面向视觉-语言-动作模型的样本高效强化学习微调 / EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

1️⃣ 一句话总结

本文提出了一种名为EXPO-FT的系统，能够利用强化学习对预训练的视觉-语言-动作模型进行高效微调，使机器人在极短时间（平均约19分钟）内学会高精度、动态的复杂操作任务，并达到100%的成功率，远超现有方法。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要