视觉-语言-动作模型的任务自适应:2025年BEHAVIOR挑战赛冠军解决方案 / Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
1️⃣ 一句话总结
这篇论文介绍了一种在复杂家庭任务模拟挑战赛中夺冠的智能体策略,它通过引入相关噪声生成平滑动作、使用可学习注意力机制解决任务歧义,并优化了训练与推理过程,从而在50项多样化任务中取得了优异表现。
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
视觉-语言-动作模型的任务自适应:2025年BEHAVIOR挑战赛冠军解决方案 / Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
这篇论文介绍了一种在复杂家庭任务模拟挑战赛中夺冠的智能体策略,它通过引入相关噪声生成平滑动作、使用可学习注意力机制解决任务歧义,并优化了训练与推理过程,从而在50项多样化任务中取得了优异表现。
源自 arXiv: 2512.06951