← 返回列表

arXiv 提交日期: 2025-12-09

📄 Abstract - Learning Robot Manipulation from Audio World Models

World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.

顶级标签: robotics multi-modal model training

从音频世界模型中学习机器人操作 / Learning Robot Manipulation from Audio World Models

1️⃣ 一句话总结

这篇论文提出了一种能预测未来音频的生成模型，帮助机器人通过聆听和理解声音的节奏与物理特性，更好地完成需要听觉判断的复杂操作任务。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2512.08405

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要