Goto

Collaborating Authors

 argmax




Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

arXiv.org Machine Learning

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(ฮฉ(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.


Building a stable classifier with the inflated argmax

Neural Information Processing Systems

We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer---i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the inflated argmax, to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.


e464656edca5e58850f8cec98cbb979b-Supplemental.pdf

Neural Information Processing Systems

To be consistent with accuracy definition, we denote the correctness ofstj for instance t as sim(stj,rt) = ( 2 distance(stj,rt))/ 2 where sim(stj,rt) is in the range [0,1] and distance(stj,rt) is in range [0, 2], 2 is the largest Euclidean distance in the probability simplex. Given a test dataset I, the correctness of a learner SLj on I can be denoted as 2 corrSLj = 1n Pn t=1sim(stj,rt). In this section, we define multiple metrics for consistency, accuracy, and correct-consistency in detail. Figure 1 shows the metrics computation in our experiments. We have created a git repository for this work and will be posted upon the acceptance and publicationofthiswork.


Enhancing Knowledge Transfer for Task Incremental Learning with Data-free Subnetwork Qiang Gao

Neural Information Processing Systems

DSN primarily seeks to transfer knowledge to the new coming task from the learned tasks by selecting the affiliated weights of a small set of neurons to be activated, including the reused neurons from prior tasks via neuron-wise masks. And it also transfers possibly valuable knowledge to the earlier tasks via data-free replay.


Model-FreeActiveExploration inReinforcementLearning

Neural Information Processing Systems

We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound ofthe number ofsamples that have to be collected to identify a nearly-optimal policy.