DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

📄 Abstract - DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{this https URL}{this http URL}

DAM-VLA：解耦异步多模态视觉语言动作模型 / DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

1️⃣ 一句话总结

本文提出DAM-VLA模型，通过让触觉、视觉和语言等不同模态按各自传感器频率独立更新并融合，解决了传统同步模型处理速度不匹配的问题，在七项高难度机器人操作任务中将平均成功率从约41%提升至95%以上，并实现了流畅的100赫兹实时控制。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要