递归信念视觉语言模型 / Recursive Belief Vision Language Model
1️⃣ 一句话总结
这篇论文提出了一个名为RB-VLA的新模型,它通过引入一个持续更新的内部‘信念’状态来记住任务历史和物体交互,从而显著提升了机器人在部分可观测环境下执行多步骤操作任务的成功率和效率,并大幅降低了计算延迟。
Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to {\pi}0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.
递归信念视觉语言模型 / Recursive Belief Vision Language Model
这篇论文提出了一个名为RB-VLA的新模型,它通过引入一个持续更新的内部‘信念’状态来记住任务历史和物体交互,从而显著提升了机器人在部分可观测环境下执行多步骤操作任务的成功率和效率,并大幅降低了计算延迟。
源自 arXiv: 2602.20659