IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

📄 Abstract - IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: this http URL

IVRA：通过基于提示的无训练引导改进机器人动作策略中的视觉-标记关系 / IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

1️⃣ 一句话总结

这篇论文提出了一种名为IVRA的轻量级、无需额外训练的方法，它通过巧妙利用视觉模型中已有的空间关联信息来增强机器人对视觉场景的几何理解，从而在多种机器人操作任务上稳定提升了动作策略的准确性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要