菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-07
📄 Abstract - Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^\pi$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.

顶级标签: reinforcement learning machine learning theory
详细标签: off-policy evaluation moment matching importance sampling finite-horizon mdps coverage 或 搜索:

Q-MMR:通过递归重加权与矩匹配进行离线策略评估 / Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching


1️⃣ 一句话总结

本文提出了一种名为Q-MMR的新方法,能够通过给每个数据点分配权重并递归匹配价值函数,在仅需知道目标策略的Q函数(即状态-动作价值)的前提下,精准估算该策略在离线数据中的表现,而且其误差大小不会随模型复杂度增加而膨胀,从而显著降低了对历史数据覆盖质量的要求。

源自 arXiv: 2605.06474