📄
Abstract - BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $\pi(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
BayesianVLA:通过潜在动作查询对视觉-语言-动作模型进行贝叶斯分解 /
BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
1️⃣ 一句话总结
这篇论文提出了一种名为BayesianVLA的新方法,通过引入贝叶斯分解和潜在动作查询,有效解决了现有机器人视觉-语言-动作模型在遇到新指令或多任务时容易忽略语言、只依赖视觉的‘信息坍缩’问题,从而显著提升了模型遵循指令和泛化到新场景的能力。