"AI systems–like people–must often act despite partial and uncertain information. First, the information received may be unreliable (e.g., a patient may mis-remember when a disease started, or may not have noticed a symptom that is important to a diagnosis). In addition, rules connecting real-world events can never include all the factors that might determine whether their conclusions really apply (e.g., the correctness of basing a diagnosis on a lab test depends whether there were conditions that might have caused a false positive, on the test being done correctly, on the results being associated with the right patient, etc.) Thus in order to draw useful conclusions, AI systems must be able to reason about the probability of events, given their current knowledge."
– from David Leake, Reasoning Under Uncertainty
I find myself coming back to the same few pictures when explaining basic machine learning concepts. Below is a list I find most illuminating. Plots of polynomials having various orders M, shown as red curves, fitted to the data set generated by the green curve. Why Bayesian inference embodies Occam's razor. This figure gives the basic intuition for why complex models can turn out to be less probable.
We introduce Tempered Geodesic Markov Chain Monte Carlo (TG-MCMC) algorithm for initializing pose graph optimization problems, arising in various scenarios such as SFM (structure from motion) or SLAM (simultaneous localization and mapping). TG-MCMC is first of its kind as it unites global non-convex optimization on the spherical manifold of quaternions with posterior sampling, in order to provide both reliable initial poses and uncertainty estimates that are informative about the quality of solutions. We devise theoretical convergence guarantees and extensively evaluate our method on synthetic and real benchmarks. Besides its elegance in formulation and theory, we show that our method is robust to missing data, noise and the estimated uncertainties capture intuitive properties of the data.
We present the very first robust Bayesian Online Changepoint Detection algorithm through General Bayesian Inference (GBI) with $\beta$-divergences. The resulting inference procedure is doubly robust for both the predictive and the changepoint (CP) posterior, with linear time and constant space complexity. We provide a construction for exponential models and demonstrate it on the Bayesian Linear Regression model. In so doing, we make two additional contributions: Firstly, we make GBI scalable using Structural Variational approximations that are exact as $\beta \to 0$. Secondly, we give a principled way of choosing the divergence parameter $\beta$ by minimizing expected predictive loss on-line. Reducing False Discovery Rates of \CPs from up to 99\% to 0\% on real world data, this offers the state of the art.
Did Richard Feynman help seed a key machine learning technique in the 60s? I struggled to learn machine learning. I was used to variational tricks, MCMC samplers, and discreet Taylor expansions from years of physics training. Now the concepts were mixed up. The intuitive models of physical systems were replaced by abstract models of'data' and amechanical patterns of cause and effect. I had to fit these fields together. Physics and machine learning are intricately connected, but it is taking me years to make the overlaps precise. This process requires representing the new with the familiar, mapping jargon from one field to another. A simple model of magnets--the Ising model--will help illustrate the rich connection between these fields. We first analyze this model with physics intuition. Then we derive the variational principle in physics and show that it recovers the same solution. We then discover how that very same variational principle in physics opens a window into machine learning.
We study the inference of a model of dynamic networks in which both communities and links keep memory of previous network states. By considering maximum likelihood inference from single snapshot observations of the network, we show that link persistence makes the inference of communities harder, decreasing the detectability threshold, while community persistence tends to make it easier. We analytically show that communities inferred from single network snapshot can share a maximum overlap with the underlying communities of a specific previous instant in time. This leads to time-lagged inference: the identification of past communities rather than present ones. Finally we compute the time lag and propose a corrected algorithm, the Lagged Snapshot Dynamic (LSD) algorithm, for community detection in dynamic networks. We analytically and numerically characterize the detectability transitions of such algorithm as a function of the memory parameters of the model and we make a comparison with a full dynamic inference.
Bayesian inference as applied in a legal setting is about belief transfer and involves a plurality of agents and communication protocols. A forensic expert (FE) may communicate to a trier of fact (TOF) first its value of a certain likelihood ratio with respect to FE's belief state as represented by a probability function on FE's proposition space. Subsequently FE communicates its recently acquired confirmation that a certain evidence proposition is true. Then TOF performs likelihood ratio transfer mediated reasoning thereby revising their own belief state. The logical principles involved in likelihood transfer mediated reasoning are discussed in a setting where probabilistic arithmetic is done within a meadow, and with Adams conditioning placed in a central role.
Learning-to-learn or meta-learning leverages data-driven inductive bias to increase the efficiency of learning on a novel task. This approach encounters difficulty when transfer is not mutually beneficial, for instance, when tasks are sufficiently dissimilar or change over time. Here, we use the connection between gradient-based meta-learning and hierarchical Bayes (Grant et al., 2018) to propose a mixture of hierarchical Bayesian models over the parameters of an arbitrary function approximator such as a neural network. Generalizing the model-agnostic meta-learning (MAML) algorithm (Finn et al., 2017), we present a stochastic expectation maximization procedure to jointly estimate parameter initializations for gradient descent as well as a latent assignment of tasks to initializations. This approach better captures the diversity of training tasks as opposed to consolidating inductive biases into a single set of hyperparameters. Our experiments demonstrate better generalization performance on the standard miniImageNet benchmark for 1-shot classification. We further derive a novel and scalable non-parametric variant of our method that captures the evolution of a task distribution over time as demonstrated on a set of few-shot regression tasks.
Binary data matrices can represent many types of data such as social networks, votes or gene expression. In some cases, the analysis of binary matrices can be tackled with nonnegative matrix factorization (NMF), where the observed data matrix is approximated by the product of two smaller nonnegative matrices. In this context, probabilistic NMF assumes a generative model where the data is usually Bernoulli-distributed. Often, a link function is used to map the factorization to the $[0,1]$ range, ensuring a valid Bernoulli mean parameter. However, link functions have the potential disadvantage to lead to uninterpretable models. Mean-parameterized NMF, on the contrary, overcomes this problem. We propose a unified framework for Bayesian mean-parameterized nonnegative binary matrix factorization models (NBMF). We analyze three models which correspond to three possible constraints that respect the mean-parametrization without the need for link functions. Furthermore, we derive a novel collapsed Gibbs sampler and a collapsed variational algorithm to infer the posterior distribution of the factors. Next, we extend the proposed models to a nonparametric setting where the number of used latent dimensions is automatically driven by the observed data. We analyze the performance of our NBMF methods in multiple datasets for different tasks such as dictionary learning and prediction of missing data. Experiments show that our methods provide similar or superior results than the state of the art, while automatically detecting the number of relevant components.
We study the connections between spectral clustering and the problems of maximum margin clustering, and estimation of the components of level sets of a density function. Specifically, we obtain bounds on the eigenvectors of graph Laplacian matrices in terms of the between cluster separation, and within cluster connectivity. These bounds ensure that the spectral clustering solution converges to the maximum margin clustering solution as the scaling parameter is reduced towards zero. The sensitivity of maximum margin clustering solutions to outlying points is well known, but can be mitigated by first removing such outliers, and applying maximum margin clustering to the remaining points. If outliers are identified using an estimate of the underlying probability density, then the remaining points may be seen as an estimate of a level set of this density function. We show that such an approach can be used to consistently estimate the components of the level sets of a density function under very mild assumptions.
We focus on variational inference in dynamical systems where the discrete time transition function (or evolution rule) is modelled by a Gaussian process. The dominant approach so far has been to use a factorised posterior distribution, decoupling the transition function from the system states. This is not exact in general and can lead to an overconfident posterior over the transition function as well as an overestimation of the intrinsic stochasticity of the system (process noise). We propose a new method that addresses these issues and incurs no additional computational costs.