菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Benchmarking Affordance Generalization with BusyBox

Vision-Language-Action (VLA) models have been attracting the attention of researchers and practitioners thanks to their promise of generalization. Although single-task policies still offer competitive performance, VLAs are increasingly able to handle commands and environments unseen in their training set. While generalization in vision and language space is undoubtedly important for robust versatile behaviors, a key meta-skill VLAs need to possess is affordance generalization -- the ability to manipulate new objects with familiar physical features. In this work, we present BusyBox, a physical benchmark for systematic semi-automatic evaluation of VLAs' affordance generalization. BusyBox consists of 6 modules with switches, sliders, wires, buttons, a display, and a dial. The modules can be swapped and rotated to create a multitude of BusyBox variations with different visual appearances but the same set of affordances. We empirically demonstrate that generalization across BusyBox variants is highly challenging even for strong open-weights VLAs such as $\pi_{0.5}$ and GR00T-N1.6. To encourage the research community to evaluate their own VLAs on BusyBox and to propose new affordance generalization experiments, we have designed BusyBox to be easy to build in most robotics labs. We release the full set of CAD files for 3D-printing its parts as well as a bill of materials for (optionally) assembling its electronics. We also publish a dataset of language-annotated demonstrations that we collected using the common bimanual Mobile Aloha robot on the canonical BusyBox configuration. All of the released materials are available at this https URL.

顶级标签: robotics multi-modal benchmark
详细标签: affordance generalization vision-language-action models physical benchmark manipulation evaluation 或 搜索:

使用BusyBox基准测试具身智能模型的“可供性”泛化能力 / Benchmarking Affordance Generalization with BusyBox


1️⃣ 一句话总结

这篇论文提出了一个名为BusyBox的物理测试平台,用于系统评估视觉-语言-动作模型在面对新物体时,能否根据其熟悉的物理特征(如开关、旋钮)进行正确操作的能力,发现现有先进模型在此类“可供性”泛化任务上仍面临巨大挑战。

源自 arXiv: 2602.05441