连续时间鲁棒马尔可夫决策过程的策略梯度方法 / Policy Gradient for Continuous-Time Robust Markov Decision Processes
1️⃣ 一句话总结
本文首次将策略梯度算法扩展到连续时间下的鲁棒马尔可夫决策过程,通过设计双重循环优化器和平均场优化器,分别实现了在理想和采样环境下的线性收敛与高样本效率,并在具有神经网络动力学的连续时间问题上验证了方法的有效性。
The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an $\tilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an $\tilde{\mathcal{O}}(\frac{1}{K})$ oracle-based convergence rate and an $\tilde{\mathcal{O}}(\frac{N^2}{\epsilon})$ sample complexity under $N$-particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.
连续时间鲁棒马尔可夫决策过程的策略梯度方法 / Policy Gradient for Continuous-Time Robust Markov Decision Processes
本文首次将策略梯度算法扩展到连续时间下的鲁棒马尔可夫决策过程,通过设计双重循环优化器和平均场优化器,分别实现了在理想和采样环境下的线性收敛与高样本效率,并在具有神经网络动力学的连续时间问题上验证了方法的有效性。
源自 arXiv: 2606.04335