Review for NeurIPS paper: On Regret with Multiple Best Arms
–Neural Information Processing Systems
Summary and Contributions: The authors consider a stochastic bandit setting with n arms, where n is taken to be very large. Unlike other large-armed bandit work, there is no structure assumed between the payoff of the various arms. Their methods rely heavily on MOSS [Audibert et al.], which attains an optimal regret rate of \sqrt{nT} in the usual setting where n T. Letting m n denote the number of arms that are optimal (or near optimal), the authors begin by defining alpha satisfying n/m T {\alpha} as a key quantity characterizing the hardness. They observe that if alpha were known, one could sample a subset of the arms of cardinality T alpha log T, which contains an optimal arm whp. Applying MOSS attains a regret bound of O(T (1 alpha)/2).
Neural Information Processing Systems
Jan-25-2025, 05:01:23 GMT
- Technology: