Collaborating Authors

Bayesian Inference

Maximum Entropy competes with Maximum Likelihood Machine Learning

Maximum entropy (MAXENT) method has a large number of applications in theoretical and applied machine learning, since it provides a convenient non-parametric tool for estimating unknown probabilities. The method is a major contribution of statistical physics to probabilistic inference. However, a systematic approach towards its validity limits is currently missing. Here we study MAXENT in a Bayesian decision theory set-up, i.e. assuming that there exists a well-defined prior Dirichlet density for unknown probabilities, and that the average Kullback-Leibler (KL) distance can be employed for deciding on the quality and applicability of various estimators. These allow to evaluate the relevance of various MAXENT constraints, check its general applicability, and compare MAXENT with estimators having various degrees of dependence on the prior, viz. the regularized maximum likelihood (ML) and the Bayesian estimators. We show that MAXENT applies in sparse data regimes, but needs specific types of prior information. In particular, MAXENT can outperform the optimally regularized ML provided that there are prior rank correlations between the estimated random quantity and its probabilities.

Are we Forgetting about Compositional Optimisers in Bayesian Optimisation? Machine Learning

Bayesian optimisation presents a sample-efficient methodology for global optimisation. Within this framework, a crucial performance-determining subroutine is the maximisation of the acquisition function, a task complicated by the fact that acquisition functions tend to be non-convex and thus nontrivial to optimise. In this paper, we undertake a comprehensive empirical study of approaches to maximise the acquisition function. Additionally, by deriving novel, yet mathematically equivalent, compositional forms for popular acquisition functions, we recast the maximisation task as a compositional optimisation problem, allowing us to benefit from the extensive literature in this field. We highlight the empirical advantages of the compositional approach to acquisition function maximisation across 3958 individual experiments comprising synthetic optimisation tasks as well as tasks from Bayesmark. Given the generality of the acquisition function maximisation subroutine, we posit that the adoption of compositional optimisers has the potential to yield performance improvements across all domains in which Bayesian optimisation is currently being applied.

A connection between the pattern classification problem and the General Linear Model for statistical inference Machine Learning

A connection between the General Linear Model (GLM) in combination with classical statistical inference and the machine learning (MLE)-based inference is described in this paper. Firstly, the estimation of the GLM parameters is expressed as a Linear Regression Model (LRM) of an indicator matrix, that is, in terms of the inverse problem of regressing the observations. In other words, both approaches, i.e. GLM and LRM, apply to different domains, the observation and the label domains, and are linked by a normalization value at the least-squares solution. Subsequently, from this relationship we derive a statistical test based on a more refined predictive algorithm, i.e. the (non)linear Support Vector Machine (SVM) that maximizes the class margin of separation, within a permutation analysis. The MLE-based inference employs a residual score and includes the upper bound to compute a better estimation of the actual (real) error. Experimental results demonstrate how the parameter estimations derived from each model resulted in different classification performances in the equivalent inverse problem. Moreover, using real data the aforementioned predictive algorithms within permutation tests, including such model-free estimators, are able to provide a good trade-off between type I error and statistical power.

L\'evy walks derived from a Bayesian decision-making model in non-stationary environments Artificial Intelligence

L\'evy walks are found in the migratory behaviour patterns of various organisms, and the reason for this phenomenon has been much discussed. We use simulations to demonstrate that learning causes the changes in confidence level during decision-making in non-stationary environments, and results in L\'evy-walk-like patterns. One inference algorithm involving confidence is Bayesian inference. We propose an algorithm that introduces the effects of learning and forgetting into Bayesian inference, and simulate an imitation game in which two decision-making agents incorporating the algorithm estimate each other's internal models from their opponent's observational data. For forgetting without learning, agent confidence levels remained low due to a lack of information on the counterpart and Brownian walks occurred for a wide range of forgetting rates. Conversely, when learning was introduced, high confidence levels occasionally occurred even at high forgetting rates, and Brownian walks universally became L\'evy walks through a mixture of high- and low-confidence states.

Semantic Annotation for Tabular Data Artificial Intelligence

Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizing over a large number of concepts or examples. Many neural network based methods also present scalability issues. Additionally, none of the known methods works well for numerical data. We propose $C^2$, a column to concept mapper that is based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs to perform effective and efficient concept prediction for structured data. We demonstrate the effectiveness of $C^2$ over available techniques on 9 datasets, the most comprehensive comparison on this topic so far.

Variational Beam Search for Online Learning with Distribution Shifts Machine Learning

We consider the problem of online learning in the presence of sudden distribution shifts as frequently encountered in applications such as autonomous navigation. Distribution shifts require constant performance monitoring and re-training. They may also be hard to detect and can lead to a slow but steady degradation in model performance. To address this problem we propose a new Bayesian meta-algorithm that can both (i) make inferences about subtle distribution shifts based on minimal sequential observations and (ii) accordingly adapt a model in an online fashion. The approach uses beam search over multiple change point hypotheses to perform inference on a hierarchical sequential latent variable modeling framework. Our proposed approach is model-agnostic, applicable to both supervised and unsupervised learning, and yields significant improvements over state-of-the-art Bayesian online learning approaches.

Calibrated Adaptive Probabilistic ODE Solvers Machine Learning

Probabilistic solvers for ordinary differential equations (ODEs) assign a posterior measure to the solution of an initial value problem. The joint covariance of this distribution provides an estimate of the (global) approximation error. The contraction rate of this error estimate as a function of the solver's step size identifies it as a well-calibrated worst-case error. But its explicit numerical value for a certain step size, which depends on certain parameters of this class of solvers, is not automatically a good estimate of the explicit error. Addressing this issue, we introduce, discuss, and assess several probabilistically motivated ways to calibrate the uncertainty estimate. Numerical experiments demonstrate that these calibration methods interact efficiently with adaptive step-size selection, resulting in descriptive, and efficiently computable posteriors. We demonstrate the efficiency of the methodology by benchmarking against the classic, widely used Dormand-Prince 4/5 Runge-Kutta method.

Bayes Meets Entailment and Prediction: Commonsense Reasoning with Non-monotonicity, Paraconsistency and Predictive Accuracy Artificial Intelligence

The recent success of Bayesian methods in neuroscience and artificial intelligence gives rise to the hypothesis that the brain is a Bayesian machine. Since logic and learning are both practices of the human brain, it leads to another hypothesis that there is a Bayesian interpretation underlying both logical reasoning and machine learning. In this paper, we introduce a generative model of logical consequence relations. It formalises the process of how the truth value of a sentence is probabilistically generated from the probability distribution over states of the world. We show that the generative model characterises a classical consequence relation, paraconsistent consequence relation and nonmonotonic consequence relation. In particular, the generative model gives a new consequence relation that outperforms them in reasoning with inconsistent knowledge. We also show that the generative model gives a new classification algorithm that outperforms several representative algorithms in predictive accuracy and complexity on the Kaggle Titanic dataset.

Quantum d-separation and quantum belief propagation Artificial Intelligence

The goal of this paper is to generalize classical d-separation and classical Belief Propagation (BP) to the quantum realm. Classical d-separation is an essential ingredient of most of Judea Pearl's work. It is crucial to all 3 rungs of what Pearl calls the 3 rungs of Causation. So having a quantum version of d-separation and BP probably implies that most of Pearl's Bayesian networks work, including his theory of causality, can be translated in a straightforward manner to the quantum realm.

Learning Energy-Based Models by Diffusion Recovery Likelihood Machine Learning

While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained by maximizing the recovery likelihood: the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. The recovery likelihood objective is more tractable than the marginal likelihood objective, since it only requires MCMC sampling from a relatively concentrated conditional distribution. Moreover, we show that this estimation method is theoretically consistent: it learns the correct conditional and marginal distributions at each noise level, given sufficient data. After training, synthesized images can be generated efficiently by a sampling process that initializes from a spherical Gaussian distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.60 and inception score 8.58, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets.