Goto

Collaborating Authors

 ucb


Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization

arXiv.org Machine Learning

Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. An audit of 40 transfer-BO papers from NeurIPS, ICML, ICLR, AISTATS, UAI, TMLR, JMLR, and AutoML-Conf (2022-2025) finds that 98% never vary B/|A| as a controlled axis. On the same GDSC2 benchmark, changing only the budget reverses the ranking: at B=50, Greedy outperforms UCB by 0.050 Hit@1, while at B=100, UCB outperforms Greedy by 0.035. We capture this transition with the Portable Regime Score PRS=(B/|A|)(1-rho), where rho is the prior rank correlation and can be estimated from pilot contexts before the main comparison. Across 79 conditions spanning chemistry, drug-response biology, and HPO, a hierarchical model gives beta=0.50 (p=1.1e-9), and 19% of conditions fall in an equivalence zone where |advantage|<0.01 Hit@1. In five published reversal cases, PRS predicts the winner from pre-comparison observables. A No-Free-Leaderboard proposition explains why unconditional rankings are unstable: when CATE changes sign across regimes, the reported ATE becomes a function of benchmark mixture. RegimePlanner, which estimates rho online and switches acquisition accordingly, wins all 16 HPO-B search spaces at B=100 and exceeds the matched {Greedy,UCB} per-context oracle on GDSC2 by 18%. Pre-registered predictions achieve 27/40=67.5% overall accuracy and above 90% within EMA prior families. The practical protocol is simple: report B/|A|, rho, K, and metric alongside any claimed acquisition advantage.


ACloser Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Neural Information Processing Systems

One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O(log n) regret in instances with "large" gaps, and a near-optimal O nlog n minimax regret when the gap can be arbitrarily "small." This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity.


ACloser Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Neural Information Processing Systems

One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O(logn) regret in instances with "large" gaps, and a near-optimal O p nlogn minimax regret when the gap can be arbitrarily "small." This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity.





Online Multi-Armed Bandits with Adaptive Inference

Neural Information Processing Systems

During online decision making in multi-armed bandits, one needs to conduct inference on the true mean reward of each arm based on data collected so far at each step. However, since the arms are adaptively selected-thereby yielding non-i.i.d.



ALMAB-DC: Active Learning, Multi-Armed Bandits, and Distributed Computing for Sequential Experimental Design and Black-Box Optimization

arXiv.org Machine Learning

Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from each observation. We propose \textbf{ALMAB-DC}, a GP-based sequential design framework combining active learning, multi-armed bandits (MAB), and distributed asynchronous computing for expensive black-box experimentation. A Gaussian process surrogate with uncertainty-aware acquisition identifies informative query points; a UCB or Thompson-sampling bandit controller allocates evaluations across parallel workers; and an asynchronous scheduler handles heterogeneous runtimes. We present cumulative regret bounds for the bandit components and characterize parallel scalability via Amdahl's Law. We validate ALMAB-DC on five benchmarks. On the two statistical experimental-design tasks, ALMAB-DC achieves lower simple regret than Equal Spacing, Random, and D-optimal designs in dose--response optimization, and in adaptive spatial field estimation matches the Greedy Max-Variance benchmark while outperforming Latin Hypercube Sampling; at $K=4$ the distributed setting reaches target performance in one-quarter of sequential wall-clock rounds. On three ML/engineering tasks (CIFAR-10 HPO, CFD drag minimization, MuJoCo RL), ALMAB-DC achieves 93.4\% CIFAR-10 accuracy (outperforming BOHB by 1.7\,pp and Optuna by 1.1\,pp), reduces airfoil drag to $C_D = 0.059$ (36.9\% below Grid Search), and improves RL return by 50\% over Grid Search. All advantages over non-ALMAB baselines are statistically significant under Bonferroni-corrected Mann--Whitney $U$ tests. Distributed execution achieves $7.5\times$ speedup at $K = 16$ agents, consistent with Amdahl's Law.


Minimizing UCB: a Better Local Search Strategy in Local Bayesian Optimization

Neural Information Processing Systems

Local Bayesian optimization is a promising practical approach to solve the high dimensional black-box function optimization problem. Among them is the approximated gradient class of methods, which implements a strategy similar to gradient descent. These methods have achieved good experimental results and theoretical guarantees. However, given the distributional properties of the Gaussian processes applied on these methods, there may be potential to further exploit the information of the Gaussian processes to facilitate the BO search. In this work, we develop the relationship between the steps of the gradient descent method and one that minimizes the Upper Confidence Bound (UCB), and show that the latter can be a better strategy than direct gradient descent when a Gaussian process is applied as a surrogate.