TempoVLA:学习速度可控的视觉-语言-动作策略 / TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
1️⃣ 一句话总结
本文提出TempoVLA,通过一个能根据速度条件调整动作大小的策略,让机器人视觉-语言-动作模型既能快速执行低风险操作,也能在危险接触时放慢速度,从而首次实现灵活的双向速度控制。
Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
TempoVLA:学习速度可控的视觉-语言-动作策略 / TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
本文提出TempoVLA,通过一个能根据速度条件调整动作大小的策略,让机器人视觉-语言-动作模型既能快速执行低风险操作,也能在危险接触时放慢速度,从而首次实现灵活的双向速度控制。
源自 arXiv: 2606.06491