用于安全强化学习的条件序列建模 / Conditional Sequence Modeling for Safe Reinforcement Learning
1️⃣ 一句话总结
这篇论文提出了一种名为RCDT的新方法,它能让智能体在只使用固定历史数据训练的情况下,学会一个能灵活适应不同安全成本限制的策略,从而在保证安全的同时实现更好的性能。
Offline safe reinforcement learning (RL) aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints. In practice, deployment requirements often vary across scenarios, necessitating a single policy that can adapt zero-shot to different cost thresholds. However, most existing offline safe RL methods are trained under a pre-specified threshold, yielding policies with limited generalization and deployment flexibility across cost thresholds. Motivated by recent progress in conditional sequence modeling (CSM), which enables flexible goal-conditioned control by specifying target returns, we propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy. RCDT is the first CSM-based offline safe RL algorithm that integrates a Lagrangian-style cost penalty with an auto-adaptive penalty coefficient. To avoid overly conservative behavior and achieve a more favorable return--cost trade-off, a reward--cost-aware trajectory reweighting mechanism and Q-value regularization are further incorporated. Extensive experiments on the DSRL benchmark demonstrate that RCDT consistently improves return--cost trade-offs over representative baselines, advancing the state-of-the-art in offline safe RL.
用于安全强化学习的条件序列建模 / Conditional Sequence Modeling for Safe Reinforcement Learning
这篇论文提出了一种名为RCDT的新方法,它能让智能体在只使用固定历史数据训练的情况下,学会一个能灵活适应不同安全成本限制的策略,从而在保证安全的同时实现更好的性能。
源自 arXiv: 2602.08584