通过人类视频的视觉-物理对齐实现空间感知的视觉-语言-动作预训练 / Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
1️⃣ 一句话总结
这篇论文提出了一种新的预训练方法,通过利用人类视频将2D视觉信息与3D物理空间对齐,让机器人AI模型在正式学习任务前就具备三维空间理解能力,从而显著提升了机器人在真实环境中执行动作的准确性和适应性。
Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.
通过人类视频的视觉-物理对齐实现空间感知的视觉-语言-动作预训练 / Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
这篇论文提出了一种新的预训练方法,通过利用人类视频将2D视觉信息与3D物理空间对齐,让机器人AI模型在正式学习任务前就具备三维空间理解能力,从而显著提升了机器人在真实环境中执行动作的准确性和适应性。
源自 arXiv: 2512.13080