菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-25
📄 Abstract - EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

顶级标签: reinforcement learning robotics multi-modal
详细标签: vision-language-action models sample efficient finetuning manipulation tasks 或 搜索:

EXPO-FT:面向视觉-语言-动作模型的样本高效强化学习微调 / EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models


1️⃣ 一句话总结

本文提出了一种名为EXPO-FT的系统,能够利用强化学习对预训练的视觉-语言-动作模型进行高效微调,使机器人在极短时间(平均约19分钟)内学会高精度、动态的复杂操作任务,并达到100%的成功率,远超现有方法。

源自 arXiv: 2605.25477