菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-18
📄 Abstract - Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

顶级标签: reinforcement learning theory model training
详细标签: temporal difference learning average reward convergence analysis markov decision processes off-policy learning 或 搜索:

平均奖励马尔可夫决策过程中差分时序差分学习的几乎必然收敛性 / Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes


1️⃣ 一句话总结

这篇论文为强化学习中用于评估长期性能的平均奖励算法提供了更实用的理论保证,证明了差分时序差分学习在更贴近实际应用的条件下也能稳定收敛。

源自 arXiv: 2602.16629