Goto

Collaborating Authors

 Bayesian Learning


Bayesian Learning via Q-Exponential Process

Neural Information Processing Systems

Regularization is one of the most fundamental topics in optimization, statistics and machine learning. To get sparsity in estimating a parameter u\in\mathbb{R} d, an \ell_q penalty term, \Vert u\Vert_q, is usually added to the objective function. What is the probabilistic distribution corresponding to such \ell_q penalty? What is the \emph{correct} stochastic process corresponding to \Vert u\Vert_q when we model functions u\in L q? This is important for statistically modeling high-dimensional objects such as images, with penalty to preserve certainty properties, e.g.


PandaSkill - Player Performance and Skill Rating in Esports: Application to League of Legends

arXiv.org Artificial Intelligence

To take the esports scene to the next level, we introduce PandaSkill, a framework for assessing player performance and skill rating. Traditional rating systems like Elo and TrueSkill often overlook individual contributions and face challenges in professional esports due to limited game data and fragmented competitive scenes. PandaSkill leverages machine learning to estimate in-game player performance from individual player statistics. Each in-game role is modeled independently, ensuring a fair comparison between them. Then, using these performance scores, PandaSkill updates the player skill ratings using the Bayesian framework OpenSkill in a free-for-all setting. In this setting, skill ratings are updated solely based on performance scores rather than game outcomes, hightlighting individual contributions. To address the challenge of isolated rating pools that hinder cross-regional comparisons, PandaSkill introduces a dual-rating system that combines players' regional ratings with a meta-rating representing each region's overall skill level. Applying PandaSkill to five years of professional League of Legends matches worldwide, we show that our method produces skill ratings that better predict game outcomes and align more closely with expert opinions compared to existing methods.


The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

arXiv.org Machine Learning

While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications. Much of the appeal of contrastive learning is that it gives a "plug-n-play" approach for swapping one modality for another. Because representations from different modalities are trained to be aligned when representing the same object, the hope is that (say) a language representation and image representation of the same scene can be used as substitutes. This property is practically appealing for at least two reasons. First, it allows us to make use of pre-trained models. If you have a model that wants to make use of (say) language input and you have access to a pre-trained image-language contrastive model, you might simply train your model on the pre-trained image representations and hope that it will continue to work when you swap in the language representations.


Can Bayesian Neural Networks Make Confident Predictions?

arXiv.org Machine Learning

Bayesian inference promises a framework for principled uncertainty quantification of neural network predictions. Barriers to adoption include the difficulty of fully characterizing posterior distributions on network parameters and the interpretability of posterior predictive distributions. We demonstrate that under a discretized prior for the inner layer weights, we can exactly characterize the posterior predictive distribution as a Gaussian mixture. This setting allows us to define equivalence classes of network parameter values which produce the same likelihood (training error) and to relate the elements of these classes to the network's scaling regime -- defined via ratios of the training sample size, the size of each layer, and the number of final layer parameters. Of particular interest are distinct parameter realizations that map to low training error and yet correspond to distinct modes in the posterior predictive distribution. We identify settings that exhibit such predictive multimodality, and thus provide insight into the accuracy of unimodal posterior approximations. We also characterize the capacity of a model to "learn from data" by evaluating contraction of the posterior predictive in different scaling regimes.


When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning

Neural Information Processing Systems

Offline inverse reinforcement learning (Offline IRL) aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving. However, the structure of an expert's preferences implicit in observed actions is closely linked to the expert's model of the environment dynamics (i.e. the world''). Thus, inaccurate models of the world obtained from finite data with limited coverage could compound inaccuracy in estimated rewards. To address this issue, we propose a bi-level optimization formulation of the estimation task wherein the upper level is likelihood maximization based upon a conservative model of the expert's policy (lower level). The policy model is conservative in that it maximizes reward subject to a penalty that is increasing in the uncertainty of the estimated model of the world.


A Bayesian Approach To Analysing Training Data Attribution In Deep Learning

Neural Information Processing Systems

Training data attribution (TDA) techniques find influential training data for the model's prediction on the test data of interest. While conceptually useful, they are hardly applicable to deep models in practice, particularly because of their sensitivity to different model initialisation. In this paper, we introduce a Bayesian perspective on the TDA task, where the learned model is treated as a Bayesian posterior and the TDA estimates as random variables. From this novel viewpoint, we observe that the influence of an individual training sample is often overshadowed by the noise stemming from model initialisation and SGD batch composition. Based on this observation, we argue that TDA can only be reliably used for explaining deep model predictions that are consistently influenced by certain training data, independent of other noise factors.


Distributionally Robust Skeleton Learning of Discrete Bayesian Networks

Neural Information Processing Systems

We consider the problem of learning the exact skeleton of general discrete Bayesian networks from potentially corrupted data. Building on distributionally robust optimization and a regression approach, we propose to optimize the most adverse risk over a family of distributions within bounded Wasserstein distance or KL divergence to the empirical distribution. The proposed approach applies for general categorical random variables without assuming faithfulness, an ordinal relationship or a specific form of conditional distribution. We present efficient algorithms and show the proposed methods are closely related to the standard regularized regression approach. Under mild assumptions, we derive non-asymptotic guarantees for successful structure learning with logarithmic sample complexities for bounded-degree graphs.


Learning via Wasserstein-Based High Probability Generalisation Bounds

Neural Information Processing Systems

Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem -- hence restricting its use in practical applications.As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM.


Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data

Neural Information Processing Systems

Estimating personalized treatment effects from high-dimensional observational data is essential in situations where experimental designs are infeasible, unethical, or expensive. Existing approaches rely on fitting deep models on outcomes observed for treated and control populations. However, when measuring individual outcomes is costly, as is the case of a tumor biopsy, a sample-efficient strategy for acquiring each result is required. Deep Bayesian active learning provides a framework for efficient data acquisition by selecting points with high uncertainty. However, existing methods bias training data acquisition towards regions of non-overlapping support between the treated and control populations.


Adjusting for Autocorrelated Errors in Neural Networks for Time Series

Neural Information Processing Systems

An increasing body of research focuses on using neural networks to model time series. A common assumption in training neural networks via maximum likelihood estimation on time series is that the errors across time steps are uncorrelated. However, errors are actually autocorrelated in many cases due to the temporality of the data, which makes such maximum likelihood estimations inaccurate. In this paper, in order to adjust for autocorrelated errors, we propose to learn the autocorrelation coefficient jointly with the model parameters. In our experiments, we verify the effectiveness of our approach on time series forecasting.