通过衰减安全裕度实现在线约束马尔可夫决策过程的近恒定强违反与末次迭代收敛 / Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins
1️⃣ 一句话总结
本文提出了一种名为FlexDOME的新算法,首次在在线安全强化学习中同时实现了近乎恒定的强约束违反、次线性的强奖励遗憾以及末次迭代收敛,解决了现有方法在约束违反和收敛稳定性上的固有矛盾。
We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.
通过衰减安全裕度实现在线约束马尔可夫决策过程的近恒定强违反与末次迭代收敛 / Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins
本文提出了一种名为FlexDOME的新算法,首次在在线安全强化学习中同时实现了近乎恒定的强约束违反、次线性的强奖励遗憾以及末次迭代收敛,解决了现有方法在约束违反和收敛稳定性上的固有矛盾。
源自 arXiv: 2602.10917