Uncertainty
Frank-Wolfe Stein Sampling
Futami, Futoshi, Cui, Zhenghang, Sato, Issei, Sugiyama, Masashi
In Bayesian inference, the posterior distributions are difficult to obtain analytically for complex models such as neural networks. Variational inference usually uses a parametric distribution to approximate, from which we can easily draw samples. Recently discrete approximation by particles has attracted attention because of its expressive ability. An example is Stein variational gradient descent (SVGD), which iteratively optimizes particles. Although SVGD has been shown to be computationally efficient empirically, its theoretical properties have not been clarified yet and no finite sample bound of a convergence rate is known. Another example is Stein points (SP), which minimizes kernelized Stein discrepancy directly. The finite sample bound of SP is $\mathcal{O}(\sqrt{\log{N}/N})$ for $N$ particles, which is computationally inefficient empirically, especially in high-dimensional problems. In this paper, we propose a novel method named \emph{Frank-Wolfe Stein sampling}, which minimizes the maximum mean discrepancy in a greedy way. Our method is computationally efficient empirically and theoretically achieves a faster convergence rate, $\mathcal{O}(e^{-N})$. Numerical experiments show the superiority of our method.
Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting
Ritter, Hippolyt, Botev, Aleksandar, Barber, David
We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode, which is typically intractable for modern architectures. In order to make our method scalable, we leverage recent block-diagonal Kronecker factored approximations to the curvature. Our algorithm achieves over 90% test accuracy across a sequence of 50 instantiations of the permuted MNIST dataset, substantially outperforming related methods for overcoming catastrophic forgetting.
Minimax Lower Bounds for Cost Sensitive Classification
Kamalaruban, Parameswaran, Williamson, Robert C.
The central problem of this paper is the cost-sensitive binary classification problem, where different costs are associated with different types of mistakes. Several important machine learning applications such as medical decision making, targeted marketing, and intrusion detection can be naturally formalized as costsensitive classification setup ([1]). In these domains, the cost of missing a target is much higher than that of a false-positive, and classifiers that do not take misclassification costs into account do not perform well. The cost-sensitive classification problem has been extensively studied, and people have developed efficient algorithms with provable guarantees on the (generalization) error [6, 9, 26, 27, 11, 4]. These methods primarily take existing classification methods based on empirical risk minimization and try to adapt them in various ways to be sensitive to these misclassification costs. Despite all these efforts, the understanding of the fundamental limits of this problem is still missing. In this paper, we study the hardness of this problem by obtaining minimax lower bounds. In particular, we are interested in understanding how the cost parameter influences the hardness or complexity of the cost-sensitive classification. Minimax Lower Bounds Understanding the hardness or fundamental limits of a learning problem is important for practice for the following reasons: - They give an estimate on the number of samples required for a good performance of a learning algorithm.
An Online RFID Localization in the Manufacturing Shopfloor
Ashfahani, Andri, Pratama, Mahardhika, Lughofer, Edwin, Cai, Qing, Sheng, Huang
Radio Frequency Identification technology has gained popularity for cheap and easy deployment. In the realm of manufacturing shopfloor, it can be used to track the location of manufacturing objects to achieve better efficiency. The underlying challenge of localization lies in the non-stationary characteristics of manufacturing shopfloor which calls for an adaptive life-long learning strategy in order to arrive at accurate localization results. This paper presents an evolving model based on a novel evolving intelligent system, namely evolving Type-2 Quantum Fuzzy Neural Network (eT2QFNN), which features an interval type-2 quantum fuzzy set with uncertain jump positions. The quantum fuzzy set possesses a graded membership degree which enables better identification of overlaps between classes. The eT2QFNN works fully in the evolving mode where all parameters including the number of rules are automatically adjusted and generated on the fly. The parameter adjustment scenario relies on decoupled extended Kalman filter method. Our numerical study shows that eT2QFNN is able to deliver comparable accuracy compared to state-of-the-art algorithms.
r/MachineLearning - [R] To Build Truly Intelligent Machines, Teach Them Cause and Effect
Just FYI: "Judea Pearl created the representational and computational foundation for the processing of information under uncertainty. He is credited with the invention of Bayesian networks,..." His wiki page doesn't list his contributions, it links to other people's summaries of his contributions. The guy you say is stuck in an old paradigm is one of the main inventors of the new paradigm.
Heterogeneous Multi-output Gaussian Process Prediction
Moreno-Muรฑoz, Pablo, Artรฉs-Rodrรญguez, Antonio, รlvarez, Mauricio A.
Multi-output Gaussian processes (MOGP) generalise the powerful Gaussian process (GP) predictive model to the vector-valued random field setup (Alvarez et al., 2012). It has been experimentally shown that by simultaneously exploiting correlations between multiple outputs and across the input space, it is possible to provide better predictions, particularly in scenarios with missing or noisy data (Bonilla et al., 2008; Dai et al., 2017). The main focus in the literature for MOGP has been on the definition of a suitable cross-covariance function between the multiple outputs that allows for the treatment of outputs as a single GP with a properly defined covariance function (Alvarez et al., 2012). The two classical alternatives to define such cross-covariance functions are the linear model of coregionalisation (LMC) (Journel and Huijbregts, 1978) and process convolutions (Higdon, 2002). In the former case, each output corresponds to a weighted sum of shared latent random functions.
Distributionally Robust Inverse Covariance Estimation: The Wasserstein Shrinkage Estimator
Nguyen, Viet Anh, Kuhn, Daniel, Esfahani, Peyman Mohajerin
We introduce a distributionally robust maximum likelihood estimation model with a Wasserstein ambiguity set to infer the inverse covariance matrix of a $p$-dimensional Gaussian random vector from $n$ independent samples. The proposed model minimizes the worst case (maximum) of Stein's loss across all normal reference distributions within a prescribed Wasserstein distance from the normal distribution characterized by the sample mean and the sample covariance matrix. We prove that this estimation problem is equivalent to a semidefinite program that is tractable in theory but beyond the reach of general purpose solvers for practically relevant problem dimensions $p$. In the absence of any prior structural information, the estimation problem has an analytical solution that is naturally interpreted as a nonlinear shrinkage estimator. Besides being invertible and well-conditioned even for $p>n$, the new shrinkage estimator is rotation-equivariant and preserves the order of the eigenvalues of the sample covariance matrix. These desirable properties are not imposed ad hoc but emerge naturally from the underlying distributionally robust optimization model. Finally, we develop a sequential quadratic approximation algorithm for efficiently solving the general estimation problem subject to conditional independence constraints typically encountered in Gaussian graphical models.
Approximate Model Counting by Partial Knowledge Compilation
Model counting is the problem of computing the number of satisfying assignments of a given propositional formula. Although exact model counters can be naturally furnished by most of the knowledge compilation (KC) methods, in practice, they fail to generate the compiled results for the exact counting of models for certain formulas due to the explosion in sizes. Decision-DNNF is an important KC language that captures most of the practical compilers. We propose a generalized Decision-DNNF (referred to as partial Decision-DNNF) via introducing a class of new leaf vertices (called unknown vertices), and then propose an algorithm called PartialKC to generate randomly partial Decision-DNNF formulas from the given formulas. An unbiased estimate of the model number can be computed via a randomly partial Decision-DNNF formula. Each calling of PartialKC consists of multiple callings of MicroKC, while each of the latter callings is a process of importance sampling equipped with KC technologies. The experimental results show that PartialKC is more accurate than both SampleSearch and SearchTreeSampler, PartialKC scales better than SearchTreeSampler, and the KC technologies can obviously accelerate sampling.
PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits
Dumitrascu, Bianca, Feng, Karen, Engelhardt, Barbara E
We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits.
Accurate Kernel Learning for Linear Gaussian Markov Processes using a Scalable Likelihood Computation
We report an exact likelihood computation for Linear Gaussian Markov processes that is more scalable than existing algorithms for complex models and sparsely sampled signals. Better scaling is achieved through elimination of repeated computations in the Kalman likelihood, and by using the diagonalized form of the state transition equation. Using this efficient computation, we study the accuracy of kernel learning using maximum likelihood and the posterior mean in a simulation experiment. The posterior mean with a reference prior is more accurate for complex models and sparse sampling. Because of its lower computation load, the maximum likelihood estimator is an attractive option for more densely sampled signals and lower order models. We confirm estimator behavior in experimental data through their application to speleothem data.