菜单

🤖 系统
📄 Abstract - SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

顶级标签: multi-modal model training robotics
详细标签: vision-language-action 4d understanding lightweight models spatiotemporal reasoning edge deployment 或 搜索:

SwiftVLA:以最小开销为轻量级视觉-语言-动作模型解锁时空动态理解能力 / SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead


1️⃣ 一句话总结

这篇论文提出了一种名为SwiftVLA的新型架构,它通过创新的融合令牌和掩码重建训练方法,让轻量级的视觉-语言-动作模型在保持高效率的同时,也能像大模型一样理解视频中的时空动态信息,从而在边缘设备上实现高性能、低延迟的机器人控制。


📄 打开原文 PDF