菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-26
📄 Abstract - A Pragmatic VLA Foundation Model

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

顶级标签: robotics multi-modal model training
详细标签: vision-language-action robot manipulation foundation model real-world data generalizability 或 搜索:

一个实用的视觉-语言-动作基础模型 / A Pragmatic VLA Foundation Model


1️⃣ 一句话总结

这篇论文提出了一个名为LingBot-VLA的实用机器人基础模型,它利用大量真实世界数据训练,在多种机器人平台上都能出色地完成不同任务,并且训练效率高、代码开源,旨在推动机器人学习领域的实际应用和发展。

源自 arXiv: 2601.18692