菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-08
📄 Abstract - Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards

We prove that $\rho\text{-}\mathrm{NPTS}_{\mathrm{SG}}$, an anchor-free nonparametric Thompson Sampling algorithm for risk-averse bandits, achieves regret matching the instance-dependent lower bound to leading order in $\log n$, establishing it as asymptotically optimal for any continuous risk functional $\rho$ (CVaR, mean-variance, Sharpe ratio, distortion risk measures, and more) on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of $\rho$: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The bounded-support case is developed first as a stepping stone sharing the same proof structure. The key technical contributions are a discretisation lemma (bounded support) and a truncated discretisation lemma (sub-Gaussian tails), each projecting the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property, holding all polynomial prefactors at fixed degree independent of sample size and breaking the super-exponential barrier that blocked prior proofs.

顶级标签: reinforcement learning theory
详细标签: thompson sampling risk-averse bandits asymptotic optimality sub-gaussian 或 搜索:

风险厌恶型多臂赌博机中汤普森采样的渐近最优性——基于次高斯奖励 / Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards


1️⃣ 一句话总结

本文证明了一种无需预设参数分布的非参数汤普森采样算法在风险厌恶型多臂赌博机问题中,能够在次高斯奖励分布下达到理论最低后悔值,且该算法仅要求风险度量函数连续,比现有方法适用更广(如夏普比率等非平滑指标),并通过巧妙的离散化技巧突破了以往证明中计算复杂度过高的障碍。

源自 arXiv: 2606.09191