FASTER:基于价值引导的快速强化学习采样方法 / FASTER: Value-Guided Sampling for Fast RL
1️⃣ 一句话总结
本文提出了一种名为FASTER的方法,通过将扩散策略中多次采样并选择最佳动作的过程建模为马尔可夫决策过程,并学习在去噪早期阶段预测和过滤低价值候选动作,从而在不牺牲性能的前提下大幅降低训练和推理的计算成本。
Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at this https URL .
FASTER:基于价值引导的快速强化学习采样方法 / FASTER: Value-Guided Sampling for Fast RL
本文提出了一种名为FASTER的方法,通过将扩散策略中多次采样并选择最佳动作的过程建模为马尔可夫决策过程,并学习在去噪早期阶段预测和过滤低价值候选动作,从而在不牺牲性能的前提下大幅降低训练和推理的计算成本。
源自 arXiv: 2604.19730