菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-27
📄 Abstract - CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $\pi_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4\%, and achieves the best average real-robot success rate of 83.0\%, outperforming MIP by 19.5 points and $\pi_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at this https URL.

顶级标签: multi-modal robotics machine learning
详细标签: vision-language-action action generation coarse-to-fine efficient inference flow-based policy 或 搜索:

CF-VLA:面向视觉-语言-动作策略的高效由粗到精动作生成方法 / CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies


1️⃣ 一句话总结

本文提出了一种名为CF-VLA的两阶段动作生成框架,先快速生成粗略的动作初始状态,再单步精细修正,大幅提升了机器人动作生成的效率与性能,在多个基准测试中相比现有方法减少了75%以上的计算延迟,并取得了更高的成功率。

源自 arXiv: 2604.24622