利用视觉-语言嵌入减少基于偏好的强化学习中的专家反馈需求 / Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL
1️⃣ 一句话总结
这篇论文提出了一个名为ROVED的混合框架,它巧妙地结合了廉价的视觉-语言模型和精准但昂贵的专家反馈,通过只在模型不确定时请求专家判断并持续优化模型,在机器人操控任务中大幅减少了80%以上的专家咨询需求,同时保持了甚至提升了学习性能。
Preference-based reinforcement learning can learn effective reward functions from comparisons, but its scalability is constrained by the high cost of oracle feedback. Lightweight vision-language embedding (VLE) models provide a cheaper alternative, but their noisy outputs limit their effectiveness as standalone reward generators. To address this challenge, we propose ROVED, a hybrid framework that combines VLE-based supervision with targeted oracle feedback. Our method uses the VLE to generate segment-level preferences and defers to an oracle only for samples with high uncertainty, identified through a filtering mechanism. In addition, we introduce a parameter-efficient fine-tuning method that adapts the VLE with the obtained oracle feedback in order to improve the model over time in a synergistic fashion. This ensures the retention of the scalability of embeddings and the accuracy of oracles, while avoiding their inefficiencies. Across multiple robotic manipulation tasks, ROVED matches or surpasses prior preference-based methods while reducing oracle queries by up to 80%. Remarkably, the adapted VLE generalizes across tasks, yielding cumulative annotation savings of up to 90%, highlighting the practicality of combining scalable embeddings with precise oracle supervision for preference-based RL.
利用视觉-语言嵌入减少基于偏好的强化学习中的专家反馈需求 / Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL
这篇论文提出了一个名为ROVED的混合框架,它巧妙地结合了廉价的视觉-语言模型和精准但昂贵的专家反馈,通过只在模型不确定时请求专家判断并持续优化模型,在机器人操控任务中大幅减少了80%以上的专家咨询需求,同时保持了甚至提升了学习性能。
源自 arXiv: 2603.28053