基于多项逻辑函数逼近的强化学习方差自适应优化算法 / Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation
1️⃣ 一句话总结
该论文提出了一种新的强化学习算法,能够根据学习过程中环境互动的变化程度自适应调整策略,在多项逻辑函数逼近下实现了理论上最优的遗憾界,并通过实验证明其比传统方法更高效地学习最优策略。
Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance depends on the variability of the interaction between the learner and the environment. In this paper, we develop a new theoretical analysis for MNL-based Markov decision processes that yields explicit variance-adaptive regret bounds. Our algorithm is computationally efficient and achieves the instance-wise optimal rate of regret, narrowing the gap between upper and lower bounds. Our numerical experiments validate that our method learns optimal policies more efficiently than conventional approaches.
基于多项逻辑函数逼近的强化学习方差自适应优化算法 / Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation
该论文提出了一种新的强化学习算法,能够根据学习过程中环境互动的变化程度自适应调整策略,在多项逻辑函数逼近下实现了理论上最优的遗憾界,并通过实验证明其比传统方法更高效地学习最优策略。
源自 arXiv: 2605.28364