利用多臂老虎机中的相似性 / Leveraging Similarities in Multi-Armed Bandits
1️⃣ 一句话总结
本文针对具有相似性的动作集合(例如共享潜在特征或层级结构)的在线学习问题,证明了传统单点反馈无法利用这些相似性,随后提出了一套能在更丰富的反馈模式下(如半赌博反馈或两点反馈)自动利用相似性、降低遗憾上界的通用算法,并在二维Lipschitz老虎机中实现了√T级别的遗憾。
In many online learning and bandit problems, the actions we consider possess inherent similarities--for instance because they share latent traits, tags, or hierarchical structure. We study online learning with a similarity-structured action set, encoded by a rooted tree whose leaves are the actions and whose levels quantify how closely two actions are related. The loss sequence is assumed tree-compatible: losses of similar actions are constrained to be close. We establish an impossibility result showing that usual one-point bandit feedback cannot, in general, leverage range or tree-induced similarity, even under very strong similarity constraints. We then provide a unified set of algorithms which adapt to a wide range of richer feedback models, from semi-bandit feedback down to multi-point bandit protocols, including the minimal two-point feedback setting. We show these algorithms exhibit best-of-both-worlds guarantees and provably exploit action similarities by replacing the number of actions $K$ by a similarity-aware effective number of actions $K_{\mathrm{eff}}$ in the regret bounds. As an application, we show that under two-point feedback, it is possible to achieve $\sqrt{T}$ regret in Lipschitz bandits when $d \leq 2$.
利用多臂老虎机中的相似性 / Leveraging Similarities in Multi-Armed Bandits
本文针对具有相似性的动作集合(例如共享潜在特征或层级结构)的在线学习问题,证明了传统单点反馈无法利用这些相似性,随后提出了一套能在更丰富的反馈模式下(如半赌博反馈或两点反馈)自动利用相似性、降低遗憾上界的通用算法,并在二维Lipschitz老虎机中实现了√T级别的遗憾。
源自 arXiv: 2606.23414