菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-23
📄 Abstract - How VLAs (Really) Work In Open-World Environments

Vision-language-action models (VLAs) have been extensively used in robotics applications, achieving great success in various manipulation problems. More recently, VLAs have been used in long-horizon tasks and evaluated on benchmarks, such as BEHAVIOR1K (B1K), for solving complex household chores. The common metric for measuring progress in such benchmarks is success rate or partial score based on satisfaction of progress-agnostic criteria, meaning only the final states of the objects are considered, regardless of the events that lead to such states. In this paper, we argue that using such evaluation protocols say little about safety aspects of operation and can potentially exaggerate reported performance, undermining core challenges for future real-world deployment. To this end, we conduct a thorough analysis of state-of-the-art models on the B1K Challenge and evaluate policies in terms of robustness via reproducibility and consistency of performance, safety aspects of policies operations, task awareness, and key elements leading to the incompletion of tasks. We then propose evaluation protocols to capture safety violations to better measure the true performance of the policies in more complex and interactive scenarios. At the end, we discuss the limitations of the existing VLAs and motivate future research.

顶级标签: robotics agents model evaluation
详细标签: vision-language-action models benchmark safety reproducibility long-horizon tasks 或 搜索:

视觉-语言-动作模型在开放世界环境中的真实表现 / How VLAs (Really) Work In Open-World Environments


1️⃣ 一句话总结

本文指出当前评估视觉-语言-动作模型(VLAs)在家庭任务中的表现时,仅关注最终成功率的做法会忽略操作过程中的安全隐患和性能夸大问题,并提出了更注重鲁棒性、一致性和安全违规的评估方法,以更真实地反映模型在复杂开放场景中的实际能力。

源自 arXiv: 2604.21192