菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-19
📄 Abstract - SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

顶级标签: reinforcement learning model training theory
详细标签: offline rl online fine-tuning actor-critic robust transfer gradient regularization 或 搜索:

SMAC:基于分数匹配的演员-评论家算法,实现鲁棒的离线到在线迁移 / SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer


1️⃣ 一句话总结

这篇论文提出了一种名为SMAC的新型离线强化学习方法,它通过在学习过程中对Q函数施加一种特殊的约束,使得训练好的智能体在从离线数据切换到在线学习时,性能不会突然下降,从而实现了平滑且高效的策略迁移。

源自 arXiv: 2602.17632