面向安全离线强化学习的鲁棒概率屏蔽方法 / Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
1️⃣ 一句话总结
本文提出一种称为“概率屏蔽”的技术,将安全策略改进方法与动作空间限制相结合,使得离线强化学习在仅使用固定数据集的情况下,也能高概率地保证所学策略的安全性和性能,尤其在数据量少时效果显著。
In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.
面向安全离线强化学习的鲁棒概率屏蔽方法 / Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
本文提出一种称为“概率屏蔽”的技术,将安全策略改进方法与动作空间限制相结合,使得离线强化学习在仅使用固定数据集的情况下,也能高概率地保证所学策略的安全性和性能,尤其在数据量少时效果显著。
源自 arXiv: 2605.10293