DualVLA:通过部分解耦推理与行动构建可泛化的具身智能体 / DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
1️⃣ 一句话总结
这篇论文提出了一个名为DualVLA的新方法,通过巧妙的数据筛选和双教师蒸馏策略,解决了通用视觉-语言-行动模型在增强推理能力时动作性能下降的问题,从而在保持强大推理能力的同时,实现了更精准的动作执行。
To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: this https URL.
DualVLA:通过部分解耦推理与行动构建可泛化的具身智能体 / DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
这篇论文提出了一个名为DualVLA的新方法,通过巧妙的数据筛选和双教师蒸馏策略,解决了通用视觉-语言-行动模型在增强推理能力时动作性能下降的问题,从而在保持强大推理能力的同时,实现了更精准的动作执行。