An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

📄 Abstract - An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our \href{this https URL}{project page}.

视觉-语言-动作模型剖析：从模块、里程碑到挑战 / An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

1️⃣ 一句话总结

这篇综述论文系统性地梳理了推动机器人发展的视觉-语言-动作模型，通过剖析其核心模块、发展里程碑，并重点聚焦于表征、执行、泛化、安全及数据评估五大核心挑战，为研究者提供了一份从入门到前沿的清晰路线图。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要