先探测后提交的多目标老虎机:有限多臂反馈的理论优势 / Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback
1️⃣ 一句话总结
这篇论文提出了一种名为‘先探测后提交’的在线决策算法,用于解决需要在多个目标(如速度、延迟、能耗)之间权衡的资源选择问题,它允许决策者在最终选择前先探测少量选项,从而在信息有限的情况下显著提升决策效率,并证明了其性能优于传统方法。
We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among $K$ candidate links/servers (arms) whose performance is a stochastic $d$-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to $q>1$ candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ($q=1$) and full-information experts ($q=K$), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the $q$ probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of $\tilde{O} (K_P d/\sqrt{qT})$, where $K_P$ is the Pareto-frontier size and $T$ is the horizon, and scalarized regret $\tilde{O} (L_\phi d\sqrt{(K/q)T})$, where $\phi$ is the scalarizer. These quantify a transparent $1/\sqrt{q}$ acceleration from limited probing. We further extend to \emph{multi-modal probing}: each probe returns $M$ modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.
先探测后提交的多目标老虎机:有限多臂反馈的理论优势 / Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback
这篇论文提出了一种名为‘先探测后提交’的在线决策算法,用于解决需要在多个目标(如速度、延迟、能耗)之间权衡的资源选择问题,它允许决策者在最终选择前先探测少量选项,从而在信息有限的情况下显著提升决策效率,并证明了其性能优于传统方法。
源自 arXiv: 2602.03175