Goto

Collaborating Authors

 Bayesian Learning


Bayesian sparsity and class sparsity priors for dictionary learning and coding

arXiv.org Machine Learning

Dictionary learning methods continue to gain popularity for the solution of challenging inverse problems. In the dictionary learning approach, the computational forward model is replaced by a large dictionary of possible outcomes, and the problem is to identify the dictionary entries that best match the data, akin to traditional query matching in search engines. Sparse coding techniques are used to guarantee that the dictionary matching identifies only few of the dictionary entries, and dictionary compression methods are used to reduce the complexity of the matching problem. In this article, we propose a work flow to facilitate the dictionary matching process. First, the full dictionary is divided into subdictionaries that are separately compressed. The error introduced by the dictionary compression is handled in the Bayesian framework as a modeling error. Furthermore, we propose a new Bayesian data-driven group sparsity coding method to help identify subdictionaries that are not relevant for the dictionary matching. After discarding irrelevant subdictionaries, the dictionary matching is addressed as a deflated problem using sparse coding. The compression and deflation steps can lead to substantial decreases of the computational complexity. The effectiveness of compensating for the dictionary compression error and using the novel group sparsity promotion to deflate the original dictionary are illustrated by applying the methodology to real world problems, the glitch detection in the LIGO experiment and hyperspectral remote sensing.


Minimalistic Collective Perception with Imperfect Sensors

arXiv.org Artificial Intelligence

Collective perception is a foundational problem in swarm robotics, in which the swarm must reach consensus on a coherent representation of the environment. An important variant of collective perception casts it as a best-of-$n$ decision-making process, in which the swarm must identify the most likely representation out of a set of alternatives. Past work on this variant primarily focused on characterizing how different algorithms navigate the speed-vs-accuracy tradeoff in a scenario where the swarm must decide on the most frequent environmental feature. Crucially, past work on best-of-$n$ decision-making assumes the robot sensors to be perfect (noise- and fault-less), limiting the real-world applicability of these algorithms. In this paper, we derive from first principles an optimal, probabilistic framework for minimalistic swarm robots equipped with flawed sensors. Then, we validate our approach in a scenario where the swarm collectively decides the frequency of a certain environmental feature. We study the speed and accuracy of the decision-making process with respect to several parameters of interest. Our approach can provide timely and accurate frequency estimates even in presence of severe sensory noise.


Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds

arXiv.org Machine Learning

Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations which jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. In order to encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational bound that can tightly lower bound the data log-likelihood. We develop more flexible aggregation schemes that generalise PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational bounds and various aggregation schemes. We show that tighter variational bounds and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.


Generalised Bayesian Inference for Discrete Intractable Likelihood

arXiv.org Machine Learning

Discrete state spaces represent a major computational challenge to statistical inference, since the computation of normalisation constants requires summation over large or possibly infinite sets, which can be impractical. This paper addresses this computational challenge through the development of a novel generalised Bayesian inference procedure suitable for discrete intractable likelihood. Inspired by recent methodological advances for continuous data, the main idea is to update beliefs about model parameters using a discrete Fisher divergence, in lieu of the problematic intractable likelihood. The result is a generalised posterior that can be sampled from using standard computational tools, such as Markov chain Monte Carlo, circumventing the intractable normalising constant. The statistical properties of the generalised posterior are analysed, with sufficient conditions for posterior consistency and asymptotic normality established. In addition, a novel and general approach to calibration of generalised posteriors is proposed. Applications are presented on lattice models for discrete spatial data and on multivariate models for count data, where in each case the methodology facilitates generalised Bayesian inference at low computational cost.


Adaptive Uncertainty-Guided Model Selection for Data-Driven PDE Discovery

arXiv.org Artificial Intelligence

We propose a new parameter-adaptive uncertainty-penalized Bayesian information criterion (UBIC) to prioritize the parsimonious partial differential equation (PDE) that sufficiently governs noisy spatial-temporal observed data with few reliable terms. Since the naive use of the BIC for model selection has been known to yield an undesirable overfitted PDE, the UBIC penalizes the found PDE not only by its complexity but also the quantified uncertainty, derived from the model supports' coefficient of variation in a probabilistic view. We also introduce physics-informed neural network learning as a simulation-based approach to further validate the selected PDE flexibly against the other discovered PDE. Numerical results affirm the successful application of the UBIC in identifying the true governing PDE. Additionally, we reveal an interesting effect of denoising the observed data on improving the trade-off between the BIC score and model complexity. Code is available at https://github.com/Pongpisit-Thanasutives/UBIC.


Information Theoretically Optimal Sample Complexity of Learning Dynamical Directed Acyclic Graphs

arXiv.org Machine Learning

In this article, the optimal sample complexity of learning the underlying interaction/dependencies of a Linear Dynamical System (LDS) over a Directed Acyclic Graph (DAG) is studied. The sample complexity of learning a DAG's structure is well-studied for static systems, where the samples of nodal states are independent and identically distributed (i.i.d.). However, such a study is less explored for DAGs with dynamical systems, where the nodal states are temporally correlated. We call such a DAG underlying an LDS as \emph{dynamical} DAG (DDAG). In particular, we consider a DDAG where the nodal dynamics are driven by unobserved exogenous noise sources that are wide-sense stationary (WSS) in time but are mutually uncorrelated, and have the same {power spectral density (PSD)}. Inspired by the static settings, a metric and an algorithm based on the PSD matrix of the observed time series are proposed to reconstruct the DDAG. The equal noise PSD assumption can be relaxed such that identifiability conditions for DDAG reconstruction are not violated. For the LDS with WSS (sub) Gaussian exogenous noise sources, it is shown that the optimal sample complexity (or length of state trajectory) needed to learn the DDAG is $n=\Theta(q\log(p/q))$, where $p$ is the number of nodes and $q$ is the maximum number of parents per node. To prove the sample complexity upper bound, a concentration bound for the PSD estimation is derived, under two different sampling strategies. A matching min-max lower bound using generalized Fano's inequality also is provided, thus showing the order optimality of the proposed algorithm.


Scalable and adaptive variational Bayes methods for Hawkes processes

arXiv.org Machine Learning

Hawkes processes are often applied to model dependence and interaction phenomena in multivariate event data sets, such as neuronal spike trains, social interactions, and financial transactions. In the nonparametric setting, learning the temporal dependence structure of Hawkes processes is generally a computationally expensive task, all the more with Bayesian estimation methods. In particular, for generalised nonlinear Hawkes processes, Monte-Carlo Markov Chain methods applied to compute the doubly intractable posterior distribution are not scalable to high-dimensional processes in practice. Recently, efficient algorithms targeting a mean-field variational approximation of the posterior distribution have been proposed. In this work, we first unify existing variational Bayes approaches under a general nonparametric inference framework, and analyse the asymptotic properties of these methods under easily verifiable conditions on the prior, the variational class, and the nonlinear model. Secondly, we propose a novel sparsity-inducing procedure, and derive an adaptive mean-field variational algorithm for the popular sigmoid Hawkes processes. Our algorithm is parallelisable and therefore computationally efficient in high-dimensional setting. Through an extensive set of numerical simulations, we also demonstrate that our procedure is able to adapt to the dimensionality of the parameter of the Hawkes process, and is partially robust to some type of model mis-specification. Keywords: temporal point processes, bayesian nonparametrics, connectivity graph, variational approximation.


Hypernetwork approach to Bayesian MAML

arXiv.org Artificial Intelligence

The main goal of Few-Shot learning algorithms is to enable learning from small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this method is to learn the shared universal weights of a meta-model, which are then adapted for specific tasks. However, the method suffers from over-fitting and poorly quantifies uncertainty due to limited data size. Bayesian approaches could, in principle, alleviate these shortcomings by learning weight distributions in place of point-wise weights. Unfortunately, previous modifications of MAML are limited due to the simplicity of Gaussian posteriors, MAML-like gradient-based weight updates, or by the same structure enforced for universal and adapted weights. In this paper, we propose a novel framework for Bayesian MAML called BayesianHMAML, which employs Hypernetworks for weight updates. It learns the universal weights point-wise, but a probabilistic structure is added when adapted for specific tasks. In such a framework, we can use simple Gaussian distributions or more complicated posteriors induced by Continuous Normalizing Flows.


Minimal Assumptions for Optimal Serology Classification: Theory and Implications for Multidimensional Settings and Impure Training Data

arXiv.org Machine Learning

Minimizing error in prevalence estimates and diagnostic classifiers remains a challenging task in serology. In theory, these problems can be reduced to modeling class-conditional probability densities (PDFs) of measurement outcomes, which control all downstream analyses. However, this task quickly succumbs to the curse of dimensionality, even for assay outputs with only a few dimensions (e.g. target antigens). To address this problem, we propose a technique that uses empirical training data to classify samples and estimate prevalence in arbitrary dimension without direct access to the conditional PDFs. We motivate this method via a lemma that relates relative conditional probabilities to minimum-error classification boundaries. This leads us to formulate an optimization problem that: (i) embeds the data in a parameterized, curved space; (ii) classifies samples based on their position relative to a coordinate axis; and (iii) subsequently optimizes the space by minimizing the empirical classification error of pure training data, for which the classes are known. Interestingly, the solution to this problem requires use of a homotopy-type method to stabilize the optimization. We then extend the analysis to the case of impure training data, for which the classes are unknown. We find that two impure datasets suffice for both prevalence estimation and classification, provided they satisfy a linear independence property. Lastly, we discuss how our analysis unifies discriminative and generative learning techniques in a common framework based on ideas from set and measure theory. Throughout, we validate our methods in the context of synthetic data and a research-use SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.


Likelihood-based inference and forecasting for trawl processes: a stochastic optimization approach

arXiv.org Machine Learning

We consider trawl processes, which are stationary and infinitely divisible stochastic processes and can describe a wide range of statistical properties, such as heavy tails and long memory. In this paper, we develop the first likelihood-based methodology for the inference of real-valued trawl processes and introduce novel deterministic and probabilistic forecasting methods. Being non-Markovian, with a highly intractable likelihood function, trawl processes require the use of composite likelihood functions to parsimoniously capture their statistical properties. We formulate the composite likelihood estimation as a stochastic optimization problem for which it is feasible to implement iterative gradient descent methods. We derive novel gradient estimators with variances that are reduced by several orders of magnitude. We analyze both the theoretical properties and practical implementation details of these estimators and release a Python library which can be used to fit a large class of trawl processes. In a simulation study, we demonstrate that our estimators outperform the generalized method of moments estimators in terms of both parameter estimation error and out-of-sample forecasting error. Finally, we formalize a stochastic chain rule for our gradient estimators. We apply the new theory to trawl processes and provide a unified likelihood-based methodology for the inference of both real-valued and integer-valued trawl processes.