Bayesian Inference
Frequentist Consistency of Generalized Variational Inference
This paper investigates Frequentist consistency properties of the posterior distributions constructed via Generalized Variational Inference (GVI). A number of generic and novel strategies are given for proving consistency, relying on the theory of $\Gamma$-convergence. Specifically, this paper shows that under minimal regularity conditions, the sequence of GVI posteriors is consistent and collapses to a point mass at the population-optimal parameter value as the number of observations goes to infinity. The results extend to the latent variable case without additional assumptions and hold under misspecification. Lastly, the paper explains how to apply the results to a selection of GVI posteriors with especially popular variational families. For example, consistency is established for GVI methods using the mean field normal variational family, normal mixtures, Gaussian process variational families as well as neural networks indexing a normal (mixture) distribution.
Connections: Log Likelihood, Cross Entropy, KL Divergence, Logistic Regression, and Neural Networks
Maximizing the (log) likelihood is equivalent to minimizing the binary cross entropy. There is literally no difference between the two objective functions, so there can be no difference between the resulting model or its characteristics. This of course, can be extended quite simply to the multiclass case using softmax cross-entropy and the so-called multinoulli likelihood, so there is no difference when doing this for multiclass cases as is typical in, say, neural networks. The difference between MLE and cross-entropy is that MLE represents a structured and principled approach to modeling and training, and binary/softmax cross-entropy simply represent special cases of that applied to problems that people typically care about. After that aside on maximum likelihood estimation, let's delve more into the relationship between negative log likelihood and cross entropy.
Expert-guided Regularization via Distance Metric Learning
Mani, Shouvik, Maasoumy, Mehdi, Pakazad, Sina, Ohlsson, Henrik
High-dimensional prediction is a challenging problem setting for traditional statistical models. Although regularization improves model performance in high dimensions, it does not sufficiently leverage knowledge on feature importances held by domain experts. As an alternative to standard regularization techniques, we propose Distance Metric Learning Regularization (DMLreg), an approach for eliciting prior knowledge from domain experts and integrating that knowledge into a regularized linear model. First, we learn a Mahalanobis distance metric between observations from pairwise similarity comparisons provided by an expert. Then, we use the learned distance metric to place prior distributions on coefficients in a linear model. Through experimental results on a simulated high-dimensional prediction problem, we show that DMLreg leads to improvements in model performance when the domain expert is knowledgeable.
Is AI different for SE?
Agrawal, Amritanshu, Menzies, Tim
What AI tools are needed for SE? Ideally, we should have simple rules that peek at data, then say "use this tool" or "use that tool". To find such a rule, we explored 120 different data sets addressing numerous problems, including bad smell detection, predicting Github issue close time, bug report analysis, defect prediction and dozens of other non-SE problems. To this data, we apply a SE-based tool that (a)~out-performs the state-of-the-art for these SE problems yet (b)~fails very badly on standard AI problems. In those results, we can find a simple rule for when to use/avoid the SE-based tool. SE data is often about infrequent issues, like the occasional defect, or the rarely exploited security violation, or the requirement that holds for one special case. But as we show, standard AI tools work best when the target is relatively more frequent. Also, we can exploit these special properties of SE, to great effect (to rapidly find better optimizations for SE tasks via a tactic called "dodging", explained in this paper). More generally, this result says we need a new kind of SE research for developing new AI tools that are more suited to SE problems.
Nonparametric Bayesian Structure Adaptation for Continual Learning
Kumar, Abhishek, Chatterjee, Sunabha, Rai, Piyush
Continual Learning is a learning paradigm where machine learning mode ls are trained with sequential or streaming tasks. Two notable directions among the recent adva nces in continual learning with neural networks are ( i) variational Bayes based regularization by learning priors from pre vious tasks, and, ( ii) learning the structure of deep networks to adapt to new tasks. S o far, these two approaches have been orthogonal. We present a principled nonparametric Bayesian appr oach for learning the structure of feed-forward neural networks, addressing the shortcomings o f both these approaches. In our model, the number of nodes in each hidden layer can automatically grow with the in troduction of each new task, and inter-task transfer occurs through the overlapping of differ ent sparse subsets of weights learned by different tasks. On benchmark datasets, our model performs comparably or better than the state-of-the-art approaches, while also being able to adaptively infer the evolving network structure in the continual learning setting.
Improved PAC-Bayesian Bounds for Linear Regression
Shalaeva, Vera, Esfahani, Alireza Fakhrizadeh, Germain, Pascal, Petreczky, Mihaly
In this paper, we improve the PAC-Bayesian error bound for linear regression derived in Germain et al. [10]. The improvements are twofold. First, the proposed error bound is tighter, and converges to the generalization loss with a well-chosen temperature parameter. Second, the error bound also holds for training data that are not independently sampled. In particular, the error bound applies to certain time series generated by well-known classes of dynamical models, such as ARX models.
Sampling-Free Learning of Bayesian Quantized Neural Networks
Su, Jiahao, Cvitkovic, Milan, Huang, Furong
Bayesian learning of model parameters in neural networks is important in scenarios where estimates with well-calibrated uncertainty are important. In this paper, we propose Bayesian quantized networks (BQNs), quantized neural networks (QNNs) for which we learn a posterior distribution over their discrete parameters. We provide a set of efficient algorithms for learning and prediction in BQNs without the need to sample from their parameters or activations, which not only allows for differentiable learning in QNNs, but also reduces the variance in gradients. We demonstrate BQNs achieve both lower predictive errors and better-calibrated uncertainties than E-QNN (with less than 20% of the negative log-likelihood). A Bayesian approach to deep learning considers the network's parameters to be random variables and seeks to infer their posterior distribution given the training data. Models trained this way, called Bayesian neural networks (BNNs) (Wang & Y eung, 2016), in principle have well-calibrated uncertainties when they make predictions, which is important in scenarios such as active learning and reinforcement learning (Gal, 2016). Furthermore, the posterior distribution over the model parameters provides valuable information for evaluation and compression of neural networks. There are three main challenges in using BNNs: (1) Intractable posterior: Computing and storing the exact posterior distribution over the network weights is intractable due to the complexity and high-dimensionality of deep networks. These challenges are typically addressed either by making simplifying assumptions about the distributions of the parameters and activations, or by using sampling-based approaches, which are expensive and unreliable (likely to overestimate the uncertainties in predictions). Our goal is to propose a sampling-free method which uses probabilistic propagation to deterministically learn BNNs. A seemingly unrelated area of deep learning research is that of quantized neural networks (QNNs), which offer advantages of computational and memory efficiency compared to continuous-valued models.
Scalable Variational Bayesian Kernel Selection for Sparse Gaussian Process Regression
Teng, Tong, Chen, Jie, Zhang, Yehong, Low, Kian Hsiang
This paper presents a variational Bayesian kernel selection (VBKS) algorithm for sparse Gaussian process regression (SGPR) models. In contrast to existing GP kernel selection algorithms that aim to select only one kernel with the highest model evidence, our proposed VBKS algorithm considers the kernel as a random variable and learns its belief from data such that the uncertainty of the kernel can be interpreted and exploited to avoid overconfident GP predictions. To achieve this, we represent the probabilistic kernel as an additional variational variable in a variational inference (VI) framework for SGPR models where its posterior belief is learned together with that of the other variational variables (i.e., inducing variables and kernel hyperparameters). In particular, we transform the discrete kernel belief into a continuous parametric distribution via reparameterization in order to apply VI. Though it is computationally challenging to jointly optimize a large number of hyperparameters due to many kernels being evaluated simultaneously by our VBKS algorithm, we show that the variational lower bound of the log-marginal likelihood can be decomposed into an additive form such that each additive term depends only on a disjoint subset of the variational variables and can thus be optimized independently. Stochastic optimization is then used to maximize the variational lower bound by iteratively improving the variational approximation of the exact posterior belief via stochastic gradient ascent, which incurs constant time per iteration and hence scales to big data. We empirically evaluate the performance of our VBKS algorithm on synthetic and massive real-world datasets.
Learning undirected models via query training
Lazaro-Gredilla, Miguel, Lehrach, Wolfgang, George, Dileep
Typical amortized inference in variational autoencoders is specialized for a single probabilistic query. Here we propose an inference network architecture that generalizes to unseen probabilistic queries. Instead of an encoder-decoder pair, we can train a single inference network directly from data, using a cost function that is stochastic not only over samples, but also over queries. We can use this network to perform the same inference tasks as we would in an undirected graphical model with hidden variables, without having to deal with the intractable partition function. The results can be mapped to the learning of an actual undirected model, which is a notoriously hard problem. Our network also marginalizes nuisance variables as required. We show that our approach generalizes to unseen probabilistic queries on also unseen test data, providing fast and flexible inference. Experiments show that this approach outperforms or matches PCD and AdVIL on 9 benchmark datasets.
Normalizing Flows for Probabilistic Modeling and Inference
Papamakarios, George, Nalisnick, Eric, Rezende, Danilo Jimenez, Mohamed, Shakir, Lakshminarayanan, Balaji
Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.