Bayesian Inference
Reviews: Deep Generative Markov State Models
This paper proposes a novel learning frame-work for Markov State Models of real valued vectors. This model can handle metastable processes i.e. processes that evolve locally in short time-scales but switch between a few clusters after very long periods. The proposed framework is based on a nice idea to decompose the transition from x1 to x2 to the probability that x1 belongs to a long-lived state and a distribution of x2 given the state. The first conditional probability is modeled using a decoding deep network whereas the second one can be represented either using a network that assigns weights to x2 or using a generative neural network. This is a very interesting manuscript.
Reviews: Cluster Variational Approximations for Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data
The paper introduces a generalization of previous variational methods for inference with jumps processes; here, the proposal approximating measure to the posterior relies on a star approximation. In application to continuous-time Bayesian networks, this means isolating clusters of nodes across children and parents, in order to build an efficient approximation to the traditional variational lower bound. The paper further presents examples and experiments that show how the proposed approach can be adapted to structure learning tasks in continuous-time settings. This is an interesting and topical contribution likely to appeal to the statistical and probabilistic community within NIPS. The paper is, in overall, well-written and reasonably well-structured. It offers a good background on previous work, helps the reader to understand its relevance and put its results in context within the existing literature.
Reviews: Clone MCMC: Parallel High-Dimensional Gaussian Gibbs Sampling
This paper proposes a new parallel approximate sampler for high-dimensional Gaussian distributions. The algorithm is a special case of a larger class of iterative samplers based on a transition equation (2) and matrix splitting that is analysed in [9]. The algorithm is similar to the Hogwild sampler in term of the update formula and the way of bias analysing, but it is more flexible in the sense that there is a scalar parameter to trade-off the bias and variance of the proposed sampler. I appreciate the detailed introduction about the mathematical background of the family of sampling algorithms and related works. It is also easy to follow the paper and understand the merit of the proposed algorithm. The illustration of the decomposition of the variance and bias in Figure 1 gives a clear explanation about the role of \eta.
Reviews: Nonparametric learning from Bayesian models with randomized objective functions
The idea: You want to do Bayesian inference on a parameter theta, with prior pi(theta) and parametric likelihood f_theta, but you're not sure if the likelihood is correctly specified. So put a nonparametric prior on the sampling distribution: a mixture of Dirichlet processes centered at f_theta with mixing distribution pi(theta). The concentration parameter of the DP provides a sliding scale between vanilla Bayesian inference (total confidence in the parametric model) and Bayesian bootstrap (no confidence at all, use the empirical distribution). This is a simple idea, but the paper presents it lucidly and compellingly, beginning with a diverse list of potential applications: the method may be viewed as regularization of a nonparametric Bayesian model towards a parametric one; as robustification of a parametric Bayesian model to misspecification; as a means of correcting a variational approximation; or as nonparametric decision theory, when the log-likelihood is swapped out for an arbitrary utility function. As for implementation, the procedure requires (1) sampling from the parametric Bayesian posterior distribution and (2) performing a p-dimensional maximization, where p is the dimension of theta.
Reviews: Generalizing Tree Probability Estimation via Bayesian Networks
In this paper the authors propose an efficient method for tree probability estimation (given a collection of trees) that relies on the description of trees as subsplit Bayesian networks. Through this representation, the authors relax the classic conditional clade distribution - which assumes that given their parent, sister clades are independent - and assume instead that given their parent subsplit, sister subsplits are independent, thus allowing more dependence structure on sister clades. The authors first present a simple maximum likelihood estimation algorithm for rooted trees, and then propose two alternatives to generalize their work to unrooted trees. They finally illustrate their method on both simulated and real-data experiments. I think this paper is very well written, in particular I have greatly appreciated the Background and SBN description sections that make use of a simple though not trivial example to introduce new notions and provide useful insights on the assumptions.
A New Architecture for Neural Enhanced Multiobject Tracking
Wei, Shaoxiu, Liang, Mingchao, Meyer, Florian
Multiobject tracking (MOT) is an important task in robotics, autonomous driving, and maritime surveillance. Traditional work on MOT is model-based and aims to establish algorithms in the framework of sequential Bayesian estimation. More recent methods are fully data-driven and rely on the training of neural networks. The two approaches have demonstrated advantages in certain scenarios. In particular, in problems where plenty of labeled data for the training of neural networks is available, data-driven MOT tends to have advantages compared to traditional methods. A natural thought is whether a general and efficient framework can integrate the two approaches. This paper advances a recently introduced hybrid model-based and data-driven method called neural-enhanced belief propagation (NEBP). Compared to existing work on NEBP for MOT, it introduces a novel neural architecture that can improve data association and new object initialization, two critical aspects of MOT. The proposed tracking method is leading the nuScenes LiDAR-only tracking challenge at the time of submission of this paper.
Robust Domain Generalisation with Causal Invariant Bayesian Neural Networks
Gendron, Gaรซl, Witbrock, Michael, Dobbie, Gillian
Deep neural networks can obtain impressive performance on various tasks under the assumption that their training domain is identical to their target domain. Performance can drop dramatically when this assumption does not hold. One explanation for this discrepancy is the presence of spurious domain-specific correlations in the training data that the network exploits. Causal mechanisms, in the other hand, can be made invariant under distribution changes as they allow disentangling the factors of distribution underlying the data generation. Yet, learning causal mechanisms to improve out-of-distribution generalisation remains an under-explored area. We propose a Bayesian neural architecture that disentangles the learning of the the data distribution from the inference process mechanisms. We show theoretically and experimentally that our model approximates reasoning under causal interventions. We demonstrate the performance of our method, outperforming point estimate-counterparts, on out-of-distribution image recognition tasks where the data distribution acts as strong adversarial confounders.
Compositional Risk Minimization
Mahajan, Divyat, Pezeshki, Mohammad, Mitliagkas, Ioannis, Ahuja, Kartik, Vincent, Pascal
In this work, we tackle a challenging and extreme form of subpopulation shift, which is termed compositional shift. Under compositional shifts, some combinations of attributes are totally absent from the training distribution but present in the test distribution. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
RL, but don't do anything I wouldn't do
Cohen, Michael K., Hutter, Marcus, Bengio, Yoshua, Russell, Stuart
In reinforcement learning, if the agent's reward differs from the designers' true utility, even only rarely, the state distribution resulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy ("Don't do anything I wouldn't do"). All current cutting-edge language models are RL agents that are KL-regularized to a "base policy" that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence that our formal results are plausibly relevant in practice. We also propose a theoretical alternative that avoids this problem by replacing the "Don't do anything I wouldn't do" principle with "Don't do anything I mightn't do".