从噪声到意图:基于残差桥接的生成式VLA策略锚定方法 / From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges
1️⃣ 一句话总结
本文提出一种名为ResVLA的新架构,通过将机器人控制信号分解为低频的全局意图和高频的局部动态,并只生成局部残差,从而解决了传统方法在连接高层语义理解与低层物理控制时效率低、对齐差的问题。
Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.
从噪声到意图:基于残差桥接的生成式VLA策略锚定方法 / From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges
本文提出一种名为ResVLA的新架构,通过将机器人控制信号分解为低频的全局意图和高频的局部动态,并只生成局部残差,从而解决了传统方法在连接高层语义理解与低层物理控制时效率低、对齐差的问题。
源自 arXiv: 2604.21391