先描述后行动:通过蒸馏语言-动作世界模型实现主动的智能体引导 / Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models
1️⃣ 一句话总结
这篇论文提出了一种名为DILLO的新方法,它通过训练一个快速的语言模型来预测智能体行动的语义结果,从而绕过了耗时的视觉模拟,在保证安全性的同时将决策速度提升了14倍,并显著提高了任务成功率。
Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.
先描述后行动:通过蒸馏语言-动作世界模型实现主动的智能体引导 / Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models
这篇论文提出了一种名为DILLO的新方法,它通过训练一个快速的语言模型来预测智能体行动的语义结果,从而绕过了耗时的视觉模拟,在保证安全性的同时将决策速度提升了14倍,并显著提高了任务成功率。
源自 arXiv: 2603.23149