非马尔可夫强化学习中的策略梯度方法 / Policy Gradient Methods for Non-Markovian Reinforcement Learning
1️⃣ 一句话总结
本文提出了一种名为ASMPG的新算法,通过联合优化智能体的内部状态表示和决策策略,解决了在状态和奖励完全依赖历史交互的非马尔可夫环境中强化学习的问题,并在理论上证明了该算法的收敛性,实验显示其优于传统的基于预测状态表示的方法。
We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.
非马尔可夫强化学习中的策略梯度方法 / Policy Gradient Methods for Non-Markovian Reinforcement Learning
本文提出了一种名为ASMPG的新算法,通过联合优化智能体的内部状态表示和决策策略,解决了在状态和奖励完全依赖历史交互的非马尔可夫环境中强化学习的问题,并在理论上证明了该算法的收敛性,实验显示其优于传统的基于预测状态表示的方法。
源自 arXiv: 2605.10816