多臂老虎机中自由探索对遗憾最小化的益处 / On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
1️⃣ 一句话总结
本文提出了一种新算法UFE-KLUCB-H,在初始的“免费探索”阶段后,能显著减少后续决策中的累积遗憾,并通过理论证明和实验验证了该算法相比传统方法的优势,尤其适合需要在有限预算下快速学习环境的情况。
We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(\alpha,\beta)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.
多臂老虎机中自由探索对遗憾最小化的益处 / On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
本文提出了一种新算法UFE-KLUCB-H,在初始的“免费探索”阶段后,能显著减少后续决策中的累积遗憾,并通过理论证明和实验验证了该算法相比传统方法的优势,尤其适合需要在有限预算下快速学习环境的情况。
源自 arXiv: 2605.25789