Goto

Collaborating Authors

 Bayesian Inference


Stability-based Generalization Bounds for Variational Inference

arXiv.org Artificial Intelligence

Variational inference (VI) is widely used for approximate inference in Bayesian machine learning. In addition to this practical success, generalization bounds for variational inference and related algorithms have been developed, mostly through the connection to PAC-Bayes analysis. A second line of work has provided algorithm-specific generalization bounds through stability arguments or using mutual information bounds, and has shown that the bounds are tight in practice, but unfortunately these bounds do not directly apply to approximate Bayesian algorithms. This paper fills this gap by developing algorithm-specific stability based generalization bounds for a class of approximate Bayesian algorithms that includes VI, specifically when using stochastic gradient descent to optimize their objective. As in the non-Bayesian case, the generalization error is bounded by by expected parameter differences on a perturbed dataset. The new approach complements PAC-Bayes analysis and can provide tighter bounds in some cases. An experimental illustration shows that the new approach yields non-vacuous bounds on modern neural network architectures and datasets and that it can shed light on performance differences between variant approximate Bayesian algorithms.


IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation

arXiv.org Artificial Intelligence

Recent advances in imitation learning, particularly using generative modelling techniques like diffusion, have enabled policies to capture complex multi-modal action distributions. However, these methods often require large datasets and multiple inference steps for action generation, posing challenges in robotics where the cost for data collection is high and computation resources are limited. To address this, we introduce IMLE Policy, a novel behaviour cloning approach based on Implicit Maximum Likelihood Estimation (IMLE). IMLE Policy excels in low-data regimes, effectively learning from minimal demonstrations and requiring 38\% less data on average to match the performance of baseline methods in learning complex multi-modal behaviours. Its simple generator-based architecture enables single-step action generation, improving inference speed by 97.3\% compared to Diffusion Policy, while outperforming single-step Flow Matching. We validate our approach across diverse manipulation tasks in simulated and real-world environments, showcasing its ability to capture complex behaviours under data constraints. Videos and code are provided on our project page: https://imle-policy.github.io/.


False Discovery Rate Control via Frequentist-assisted Horseshoe

arXiv.org Machine Learning

The horseshoe prior, a widely used handy alternative to the spike-and-slab prior, has proven to be an exceptional default global-local shrinkage prior in Bayesian inference and machine learning. However, designing tests with frequentist false discovery rate (FDR) control using the horseshoe prior or the general class of global-local shrinkage priors remains an open problem. In this paper, we propose a frequentist-assisted horseshoe procedure that not only resolves this long-standing FDR control issue for the high dimensional normal means testing problem but also exhibits satisfactory finite-sample FDR control under any desired nominal level for both large-scale multiple independent and correlated tests. We carry out the frequentist-assisted horseshoe procedure in an easy and intuitive way by using the minimax estimator of the global parameter of the horseshoe prior while maintaining the remaining full Bayes vanilla horseshoe structure. The results of both intensive simulations under different sparsity levels, and real-world data demonstrate that the frequentist-assisted horseshoe procedure consistently achieves robust finite-sample FDR control. Existing frequentist or Bayesian FDR control procedures can lose finite-sample FDR control in a variety of common sparse cases. Based on the intimate relationship between the minimax estimation and the level of FDR control discovered in this work, we point out potential generalizations to achieve FDR control for both more complicated models and the general global-local shrinkage prior family.


Improved Online Confidence Bounds for Multinomial Logistic Bandits

arXiv.org Machine Learning

In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameter $B$ and the maximum size of possible outcomes $K$. To address this, we first derive an online confidence bound of $O\left(\sqrt{d \log t} + B \right)$, which is a significant improvement over the previous bound of $O (B \sqrt{d} \log t \log K )$ (Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and introducing a novel intermediary term to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm, OFU-MNL++, which achieves a variance-dependent regret bound of $O \Big( d \log T \sqrt{ \smash[b]{\sum_{t=1}^T} \sigma_t^2 } \Big) $ for sufficiently large $T$, where $\sigma_t^2$ denotes the variance of the rewards at round $t$, $d$ is the dimension of the contexts, and $T$ is the total number of rounds. Furthermore, we introduce a Maximum Likelihood Estimation (MLE)-based algorithm, OFU-MN$^2$L, which achieves an anytime poly(B)-free regret of $O \Big( d \log (BT) \sqrt{ \smash[b]{\sum_{t=1}^T} \sigma_t^2 } \Big) $.


In-Context Parametric Inference: Point or Distribution Estimators?

arXiv.org Machine Learning

Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.


Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

arXiv.org Artificial Intelligence

The foundation model enables general-purpose problem-solving and enjoys desirable rapid adaptation due to its adopted cross-task generalization paradigms, e.g., pretraining, meta-training, and finetuning. Recent advances in these paradigms show the crucial role of challenging tasks' prioritized sampling in enhancing adaptation robustness. However, ranking task difficulties exhausts massive task queries to evaluate, thus computation and annotation intensive, which is typically unaffordable in practice. This work underscores the criticality of both adaptation robustness and learning efficiency, especially in scenarios where tasks are risky or costly to evaluate, e.g., policy evaluations in Markov decision processes (MDPs) or inference with large models. To this end, we present Model Predictive Task Sampling (MPTS) to establish connections between the task space and adaptation risk landscape to form a theoretical guideline in robust active task sampling. MPTS characterizes the task episodic information with a generative model and directly predicts task-specific adaptation risk values from posterior inference. The developed risk learner can amortize expensive evaluation and provably approximately rank task difficulties in the pursuit of task robust adaptation. MPTS can be seamlessly integrated into zero-shot, few-shot, and many-shot learning paradigms. Extensive experimental results are conducted to exhibit the superiority of the proposed framework, remarkably increasing task adaptation robustness and retaining learning efficiency in contrast to existing state-of-the-art (SOTA) methods. The code is available at the project site https://github.com/thu-rllab/MPTS.


A statistical theory of overfitting for imbalanced classification

arXiv.org Machine Learning

Classification with imbalanced data is a common challenge in data analysis, where certain classes (minority classes) account for a small fraction of the training data compared with other classes (majority classes). Classical statistical theory based on large-sample asymptotics and finite-sample corrections is often ineffective for high-dimensional data, leaving many overfitting phenomena in empirical machine learning unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced classification by investigating support vector machines and logistic regression. We find that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem under high-dimensional asymptotics. In particular, for linearly separable data generated from a two-component Gaussian mixture model, the logits from each class follow a normal distribution $\mathsf{N}(0,1)$ on the testing set, but asymptotically follow a rectified normal distribution $\max\{\kappa, \mathsf{N}(0,1)\}$ on the training set -- which is a pervasive phenomenon we verified on tabular data, image data, and text data. This phenomenon explains why the minority class is more severely affected by overfitting. Further, we show that margin rebalancing, which incorporates class sizes into the loss function, is crucial for mitigating the accuracy drop for the minority class. Our theory also provides insights into the effects of overfitting on calibration and other uncertain quantification measures.


Multifidelity Simulation-based Inference for Computationally Expensive Simulators

arXiv.org Machine Learning

Across many domains of science, stochastic models are an essential tool to understand the mechanisms underlying empirically observed data. Models can be of different levels of detail and accuracy, with models of high-fidelity (i.e., high accuracy) to the phenomena under study being often preferable. However, inferring parameters of high-fidelity models via simulation-based inference is challenging, especially when the simulator is computationally expensive. We introduce MF-NPE, a multifidelity approach to neural posterior estimation that leverages inexpensive low-fidelity simulations to infer parameters of high-fidelity simulators within a limited simulation budget. MF-NPE performs neural posterior estimation with limited high-fidelity resources by virtue of transfer learning, with the ability to prioritize individual observations using active learning. On one statistical task with analytical ground-truth and two real-world tasks, MF-NPE shows comparable performance to current approaches while requiring up to two orders of magnitude fewer high-fidelity simulations. Overall, MF-NPE opens new opportunities to perform efficient Bayesian inference on computationally expensive simulators.


Estimation of the Learning Coefficient Using Empirical Loss

arXiv.org Machine Learning

The learning coefficient plays a crucial role in analyzing the performance of information criteria, such as the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC), which Sumio Watanabe developed to assess model generalization ability. In regular statistical models, the learning coefficient is given by d/2, where d is the dimension of the parameter space. More generally, it is defined as the absolute value of the pole order of a zeta function derived from the Kullback-Leibler divergence and the prior distribution. However, except for specific cases such as reduced-rank regression, the learning coefficient cannot be derived in a closed form. Watanabe proposed a numerical method to estimate the learning coefficient, which Imai further refined to enhance its convergence properties. These methods utilize the asymptotic behavior of WBIC and have been shown to be statistically consistent as the sample size grows. In this paper, we propose a novel numerical estimation method that fundamentally differs from previous approaches and leverages a new quantity, "Empirical Loss," which was introduced by Watanabe. Through numerical experiments, we demonstrate that our proposed method exhibits both lower bias and lower variance compared to those of Watanabe and Imai. Additionally, we provide a theoretical analysis that elucidates why our method outperforms existing techniques and present empirical evidence that supports our findings.


Expert-Agnostic Learning to Defer

arXiv.org Artificial Intelligence

Recent advancements in this field have including the development of consistent surrogate losses for introduced features enabling flexibility to unseen training these systems (Mozannar & Sontag, 2021; Verma experts at test-time, but we find these approaches & Nalisnick, 2022), and extensions that allow for deferral have significant limitations. To address these, we to multiple experts (Verma et al., 2023). Recent work by introduce EA-L2D: Expert-Agnostic Learning to Tailor et al. (2024) proposed a meta-learning solution for Defer, a novel L2D framework that leverages a L2D systems that can adapt to experts not seen during the Bayesian approach to model expert behaviour in training regime through meta-learning representations of an expert-agnostic manner, facilitating optimal expert behaviours, enabling the system to quickly adapt to deferral decisions. EA-L2D offers several critical new experts using a small set of their example predictions, improvements over prior methods, including denoted context predictions. However, this approach exhibits the ability to incorporate prior knowledge about a key weakness in limited generalisation to experts experts, a reduced reliance on expert-annotated with expertise unseen during training. Additionally, their data, and robust performance when deferring to solution poses problems seen more widely in L2D literature, experts with expertise not seen during training.