菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-14
📄 Abstract - Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

顶级标签: agents multi-modal model training
详细标签: vision-language-action latent planning chain-of-thought embodied ai efficient inference 或 搜索:

Fast-ThinkAct:通过可语言化的潜在规划实现高效的视觉-语言-行动推理 / Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning


1️⃣ 一句话总结

这篇论文提出了一种名为Fast-ThinkAct的新方法,它通过让AI模型学习一种高效的、内部的‘思考’方式,在保持机器人等智能体复杂任务规划能力的同时,大幅降低了决策所需的时间,从而让智能体反应更快、更实用。

源自 arXiv: 2601.09708