Goto

Collaborating Authors

 argmax


Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

Guan, Changkun, Xu, Mengfan

arXiv.org Machine Learning

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(Ω(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.


e464656edca5e58850f8cec98cbb979b-Supplemental.pdf

Neural Information Processing Systems

To be consistent with accuracy definition, we denote the correctness ofstj for instance t as sim(stj,rt) = ( 2 distance(stj,rt))/ 2 where sim(stj,rt) is in the range [0,1] and distance(stj,rt) is in range [0, 2], 2 is the largest Euclidean distance in the probability simplex. Given a test dataset I, the correctness of a learner SLj on I can be denoted as 2 corrSLj = 1n Pn t=1sim(stj,rt). In this section, we define multiple metrics for consistency, accuracy, and correct-consistency in detail. Figure 1 shows the metrics computation in our experiments. We have created a git repository for this work and will be posted upon the acceptance and publicationofthiswork.


Enhancing Knowledge Transfer for Task Incremental Learning with Data-free Subnetwork Qiang Gao

Neural Information Processing Systems

DSN primarily seeks to transfer knowledge to the new coming task from the learned tasks by selecting the affiliated weights of a small set of neurons to be activated, including the reused neurons from prior tasks via neuron-wise masks. And it also transfers possibly valuable knowledge to the earlier tasks via data-free replay.


Model-FreeActiveExploration inReinforcementLearning

Neural Information Processing Systems

We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound ofthe number ofsamples that have to be collected to identify a nearly-optimal policy.







LatentTemplateInductionwithGumbel-CRFs Appendix

Neural Information Processing Systems

Papandreou and Yuille[4] proposed the Perturb-and-MAP Random Field, an efficient sampling method forgeneral MarkovRandom Field. We compare the detailed structure of gradients of each estimator. All gradients are formed as a summation over the steps. The Gumbel-CRF and PM-MRF estimator can be decomposed with a pathwise term, where we take gradientoff w.r.t. Since the official test set is not publically available, we use the same training/ validation/ test split as Fu et al.[1].