菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-11
📄 Abstract - DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

顶级标签: agents computer vision multi-modal
详细标签: autonomous driving world dynamics chain-of-thought action reasoning vision-language-action 或 搜索:

DynVLA:学习世界动态以实现自动驾驶中的行为推理 / DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving


1️⃣ 一句话总结

这篇论文提出了一种名为DynVLA的自动驾驶模型,它通过预测未来世界的紧凑动态变化来辅助决策,比单纯依赖文字或密集图像预测的方法更高效、更准确。

源自 arXiv: 2603.11041