菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-01
📄 Abstract - FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.

顶级标签: robotics vision-language-action model long-horizon
详细标签: bimanual manipulation furniture assembly simulation pipeline sim-to-real transfer progress signal 或 搜索:

家具VLA:利用视觉-语言-动作模型学习长期双手机器人家具组装 / FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model


1️⃣ 一句话总结

本文首次系统研究了真实尺寸家具的双手机器人组装问题,提出了一种结合视觉、语言和动作的模型FurnitureVLA,通过引入进度信号和语义子任务划分,显著提升了多步骤复杂组装任务的成功率,并在真实机器人平台上验证了其有效性。

源自 arXiv: 2607.01212