Goto

Collaborating Authors

 processing system


Joint Model and Data Sparsification via the Marginal Likelihood

arXiv.org Machine Learning

Sparse recovery in linear systems underpins applications from signal processing to high-dimensional regression. Sparse Bayesian Learning, grounded in the principle of automatic relevance determination (ARD), offers a practical Bayesian mechanism for feature sparsity via marginal likelihood optimization. Yet, its reliance on a homoscedastic noise model renders it sensitive to data contaminations such as outliers or misspecified noise, harming model fit and predictions. Instead, we propose jointly learning individual feature and sample relevancies, enabling simultaneous model and data sparsification via a single Bayesian objective. This symmetric pruning of model and data offers a natural extension that preserves conjugacy, admits closed-form updates for standard optimization procedures, and aligns with perspectives from robust regression and influence functions. Empirical results across diverse regression tasks affirm that a joint ARD approach consistently yields both sparse and robust prediction models.


Constrained Bayesian Experimental Design via Online Planning

arXiv.org Machine Learning

Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.


Optimal Regret for Single Index Bandits

arXiv.org Machine Learning

We study the $\textit{single-index bandit}$ problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particularly relevant when the reward function is not known in advance. While optimal regret guarantees are known for monotone reward functions, the general non-monotone case remains poorly understood, with the best known bound being $\tilde{\mathcal{O}}(T^{3/4})$ (under standard boundedness and Lipschitz assumptions on the reward function [Kang et al., 2025]). We close this gap by establishing the optimal regret for general single-index bandits. We propose a simple two-phase algorithm, namely, Zoomed Single Index Bandit with Upper Confidence Bound ($\texttt{ZoomSIB-UCB}$), that first estimates the projection direction via a normalized Stein estimator, and then reduces the problem to a one-dimensional bandit using discretization and finally use UCB. This approach achieves a regret of $\tilde{\mathcal{O}}(T^{2/3})$, and improves significantly upon prior work without any additional assumptions. We also prove a matching minimax lower bound of $\tildeΩ(T^{2/3})$, showing that the upper bound is essentially tight. Our upper and lower bounds together provide a sharp characterization of the regret in single-index bandits. Moreover, the empirical results further demonstrate the effectiveness and robustness of our approach.


Optimized Deferral for Imbalanced Settings

arXiv.org Machine Learning

Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.


Submodular Cover Problem Bicriteria Approximation Algorithms for the

Neural Information Processing Systems

Another example is when expected advertising revenue if we set τ = max{f(X): X U}, SCP asks to find the set of minimum size in U that achieves measure how effectively a subset X summarizes the entire dataset U [Tschiatschek et al., 2014].


Occam's Razor is Only as Sharp as Your ELBO

arXiv.org Machine Learning

The marginal likelihood, also known as the evidence, is regarded as a mathematical embodiment of Occam's razor, enabling model selection that avoids overfitting. The evidence lower bound (ELBO) objective from variational inference has also been used for similar purposes. Prior work has shown that restricting the approximate posterior family via a mean-field approximation can lead the ELBO to underfit. In this paper, we show how ELBO-based hyperparameter learning in a simple over-parameterized regression model can also produce overfitting, depending on the assumed rank of the covariance matrix in a Gaussian approximate posterior. Surprisingly, among only the underfit and overfit options, Bayesian model selection via the evidence itself sometimes prefers the overfit version, while the ELBO does not. Bayesian practitioners hoping to scale to large models should be cautious about how reduced-rank assumptions needed for tractability may impact the potential for model selection.


Laplace Approximation for Bayesian Tensor Network Kernel Machines

arXiv.org Machine Learning

Uncertainty estimation is essential for robust decision-making in the presence of ambiguous or out-of-distribution inputs. Gaussian Processes (GPs) are classical kernel-based models that offer principled uncertainty quantification and perform well on small- to medium-scale datasets. Alternatively, formulating the weight space learning problem under tensor network assumptions yields scalable tensor network kernel machines. However, these assumptions break Gaussianity, complicating standard probabilistic inference. This raises a fundamental question: how can tensor network kernel machines provide principled uncertainty estimates? We propose a novel Bayesian Tensor Network Kernel Machine (LA-TNKM) that employs a (linearized) Laplace approximation for Bayesian inference. A comprehensive set of numerical experiments shows that the proposed method consistently matches or surpasses Gaussian Processes and Bayesian Neural Networks (BNNs) across diverse UCI regression benchmarks, highlighting both its effectiveness and practical relevance.


Parallelizing Thompson Sampling

Neural Information Processing Systems

How can we make use of information parallelism in online decision making problems while efficiently balancing the exploration-exploitation trade-off? In this paper, we introduce a batch Thompson Sampling framework for two canonical online decision making problems, namely, stochastic multi-arm bandit and linear contextual bandit with finitely many arms. Over a time horizon T, our batch Thompson Sampling policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only O(log T) batch queries. To achieve this exponential reduction, i.e., reducing the number of interactions from T to O(log T), our batch policy dynamically determines the duration of each batch in order to balance the exploration-exploitation trade-off. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.


On the Universality of Graph Neural Networks on Large Random Graphs

Neural Information Processing Systems

We study the approximation power of Graph Neural Networks (GNNs) on latent position random graphs. In the large graph limit, GNNs are known to converge to certain "continuous" models known as c-GNNs, which directly enables a study of their approximation power on random graph models. In the absence of input node features however, just as GNNs are limited by the Weisfeiler-Lehman isomorphism test, c-GNNs will be severely limited on simple random graph models. For instance, they will fail to distinguish the communities of a well-separated Stochastic Block Model (SBM) with constant degree function. Thus, we consider recently proposed architectures that augment GNNs with unique node identifiers, referred to as Structural GNNs here (SGNNs). We study the convergence of SGNNs to their continuous counterpart (c-SGNNs) in the large random graph limit, under new conditions on the node identifiers. We then show that c-SGNNs are strictly more powerful than c-GNNs in the continuous limit, and prove their universality on several random graph models of interest, including most SBMs and a large class of random geometric graphs. Our results cover both permutation-invariant and permutation-equivariant architectures.