面向视觉-语言-动作驾驶的因果场景叙述与运行时安全监督 / Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
1️⃣ 一句话总结
这篇论文提出了一种名为‘因果场景叙述’的新方法,通过重新组织自动驾驶模型的文本指令,使其更清晰地区分驾驶意图和环境约束,并结合运行时安全监督,显著提升了自动驾驶系统的整体性能和安全性。
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.
面向视觉-语言-动作驾驶的因果场景叙述与运行时安全监督 / Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
这篇论文提出了一种名为‘因果场景叙述’的新方法,通过重新组织自动驾驶模型的文本指令,使其更清晰地区分驾驶意图和环境约束,并结合运行时安全监督,显著提升了自动驾驶系统的整体性能和安全性。
源自 arXiv: 2604.01723