菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-10
📄 Abstract - HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter-state changes while filtering static pixel-level noise. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a ''think-while-acting'' paradigm for long-horizon manipulation. As a result, HiF-VLA surpasses strong baselines on LIBERO-Long and CALVIN ABC-D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.

顶级标签: robotics multi-modal model training
详细标签: vision-language-action temporal reasoning motion representation long-horizon tasks robot manipulation 或 搜索:

HiF-VLA:基于运动向量的双向时序推理视觉语言动作模型 / HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models


1️⃣ 一句话总结

本文提出了HiF-VLA框架,通过将运动视为紧凑的时序表示,并整合后见、洞见和先见进行双向时序推理,有效解决了现有视觉语言动作模型因时间近视导致的长期任务连贯性问题,并在多个基准测试和真实世界任务中展现出卓越性能。


2️⃣ 论文创新点

1. 以运动作为紧凑的时序表示

2. 双向时序推理框架(HiF-VLA)

3. 后见调制的联合专家模块

4. 联合动作与运动预测训练目标


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.09928