迈向无参数时间差分学习 / Towards Parameter-Free Temporal Difference Learning
1️⃣ 一句话总结
这篇论文提出了一种使用指数步长调度的新方法,让强化学习中的核心算法——时间差分学习——无需依赖难以获取的问题特定参数,就能在理论和实践中都实现高效稳定的收敛。
Temporal difference (TD) learning is a fundamental algorithm for estimating value functions in reinforcement learning. Recent finite-time analyses of TD with linear function approximation quantify its theoretical convergence rate. However, they often require setting the algorithm parameters using problem-dependent quantities that are difficult to estimate in practice -- such as the minimum eigenvalue of the feature covariance (\(\omega\)) or the mixing time of the underlying Markov chain (\(\tau_{\text{mix}}\)). In addition, some analyses rely on nonstandard and impractical modifications, exacerbating the gap between theory and practice. To address these limitations, we use an exponential step-size schedule with the standard TD(0) algorithm. We analyze the resulting method under two sampling regimes: independent and identically distributed (i.i.d.) sampling from the stationary distribution, and the more practical Markovian sampling along a single trajectory. In the i.i.d.\ setting, the proposed algorithm does not require knowledge of problem-dependent quantities such as \(\omega\), and attains the optimal bias-variance trade-off for the last iterate. In the Markovian setting, we propose a regularized TD(0) algorithm with an exponential step-size schedule. The resulting algorithm achieves a comparable convergence rate to prior works, without requiring projections, iterate averaging, or knowledge of \(\tau_{\text{mix}}\) or \(\omega\).
迈向无参数时间差分学习 / Towards Parameter-Free Temporal Difference Learning
这篇论文提出了一种使用指数步长调度的新方法,让强化学习中的核心算法——时间差分学习——无需依赖难以获取的问题特定参数,就能在理论和实践中都实现高效稳定的收敛。
源自 arXiv: 2603.02577