菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-10
📄 Abstract - Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

顶级标签: robotics computer vision machine learning
详细标签: world action model representation alignment robot manipulation video diffusion action grounding 或 搜索:

让预见变为可行动:在世界动作模型中重新利用表征对齐 / Making Foresight Actionable: Repurposing Representation Alignment in World Action Models


1️⃣ 一句话总结

这篇论文发现,用于机器人操作的视频预测模型虽然能生成逼真的未来场景,但从中提取准确动作却常失败,原因是模型隐藏状态更适合视觉重建而非动作控制;为此,作者提出AGRA方法,通过将视频扩散特征与基础视觉编码器的语义表征对齐,强制模型关注与任务相关的交互区域,从而显著提升了动作定位、物体理解和抗干扰能力,使机器人策略在多种环境下更稳定可靠。

源自 arXiv: 2606.12217