Bayesian Inference
Bayesian methods for low-rank matrix estimation: short survey and theoretical study
The problem of low-rank matrix estimation recently received a lot of attention due to challenging applications. A lot of work has been done on rank-penalized methods and convex relaxation, both on the theoretical and applied sides. However, only a few papers considered Bayesian estimation. In this paper, we review the different type of priors considered on matrices to favour low-rank. We also prove that the obtained Bayesian estimators, under suitable assumptions, enjoys the same optimality properties as the ones based on penalization.
Asymptotic Properties of Recursive Maximum Likelihood Estimation in Non-Linear State-Space Models
Tadic, Vladislav Z. B., Doucet, Arnaud
Using stochastic gradient search and the optimal filter derivative, it is possible to perform recursive (i.e., online) maximum likelihood estimation in a non-linear state-space model. As the optimal filter and its derivative are analytically intractable for such a model, they need to be approximated numerically. In [Poyiadjis, Doucet and Singh, Biometrika 2018], a recursive maximum likelihood algorithm based on a particle approximation to the optimal filter derivative has been proposed and studied through numerical simulations. Here, this algorithm and its asymptotic behavior are analyzed theoretically. We show that the algorithm accurately estimates maxima to the underlying (average) log-likelihood when the number of particles is sufficiently large. We also derive (relatively) tight bounds on the estimation error. The obtained results hold under (relatively) mild conditions and cover several classes of non-linear state-space models met in practice.
Why Interpretability in Machine Learning? An Answer Using Distributed Detection and Data Fusion Theory
Varshney, Kush R., Khanduri, Prashant, Sharma, Pranay, Zhang, Shan, Varshney, Pramod K.
As artificial intelligence is increasingly affecting all parts of society and life, there is growing recognition that human interpretability of machine learning models is important. It is often argued that accuracy or other similar generalization performance metrics must be sacrificed in order to gain interpretability. Such arguments, however, fail to acknowledge that the overall decision-making system is composed of two entities: the learned model and a human who fuses together model outputs with his or her own information. As such, the relevant performance criteria should be for the entire system, not just for the machine learning component. In this work, we characterize the performance of such two-node tandem data fusion systems using the theory of distributed detection. In doing so, we work in the population setting and model interpretable learned models as multi-level quantizers. We prove that under our abstraction, the overall system of a human with an interpretable classifier outperforms one with a black box classifier.
Learning dynamical systems with particle stochastic approximation EM
Svensson, Andreas, Lindsten, Fredrik
Learning of dynamical systems, or state-space models, is central to many machine learning problems, such as reinforcement learning, sequence modeling, and autonomous systems. Furthermore, state-space models are at the core of recent model developments within the machine learning area, such as Gaussian process state-space models (Frigola et al. 2014a; Mattos et al. 2016; etc.), infinite factorial dynamical models (Gael et al., 2009; Valera et al., 2015), and stochastic recurrent neural networks (Fraccaro et al., 2016, for example). A strategy to learn state-space models, independently suggested by Digalakis et al. (1993) and Ghahramani and Hinton (1996), is the use of the Expectation Maximization (EM, Dempster et al. 1977) method. Even though originally proposed only for maximum likelihood estimation of linear models with Gaussian noise, the strategy can be generalized to the more challenging nonlinear and non-Gaussian cases, as well as the empirical Bayes setting. Many contributions have been made during the last decade, and this paper takes another step along the path towards a more computationally efficient method with a solid theoretical ground for learning of nonlinear dynamical systems.
Fundamental limits of detection in the spiked Wigner model
Alaoui, Ahmed El, Krzakala, Florent, Jordan, Michael I.
We study the fundamental limits of detecting the presence of an additive rank-one perturbation, or spike, to a Wigner matrix. When the spike comes from a prior that is i.i.d. across coordinates, we prove that the log-likelihood ratio of the spiked model against the non-spiked one is asymptotically normal below a certain reconstruction threshold which is not necessarily of a "spectral" nature, and that it is degenerate above. This establishes the maximal region of contiguity between the planted and null models. It is known that this threshold also marks a phase transition for estimating the spike: the latter task is possible above the threshold and impossible below. Therefore, both estimation and detection undergo the same transition in this random matrix model. We also provide further information about the performance of the optimal test. Our proofs are based on Gaussian interpolation methods and a rigorous incarnation of the cavity method, as devised by Guerra and Talagrand in their study of the Sherrington--Kirkpatrick spin-glass model.
Accelerating likelihood optimization for ICA on real signals
Ablin, Pierre, Cardoso, Jean-Franรงois, Gramfort, Alexandre
We study optimization methods for solving the maximum likelihood formulation of independent component analysis (ICA). We consider both the the problem constrained to white signals and the unconstrained problem. The Hessian of the objective function is costly to compute, which renders Newton's method impractical for large data sets. Many algorithms proposed in the literature can be rewritten as quasi-Newton methods, for which the Hessian approximation is cheap to compute. These algorithms are very fast on simulated data where the linear mixture assumption really holds. However, on real signals, we observe that their rate of convergence can be severely impaired. In this paper, we investigate the origins of this behavior, and show that the recently proposed Preconditioned ICA for Real Data (Picard) algorithm overcomes this issue on both constrained and unconstrained problems.
On consistent estimation of the missing mass
Ayed, Fadhel, Battiston, Marco, Camerlenghi, Federico, Favaro, Stefano
Given $n$ samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the $(n+1)$-th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results by Ohannessian and Dahleh \citet{Oha12} and Mossel and Ohannessian \citet{Mos15} showed: i) the impossibility of estimating (learning) the missing mass without imposing further structural assumptions on the type proportions; ii) the consistency of the Good-Turing estimator for the missing mass under the assumption that the tail of the type proportions decays to zero as a regularly varying function with parameter $\alpha\in(0,1)$. In this paper we rely on tools from Bayesian nonparametrics to provide an alternative, and simpler, proof of the impossibility of a distribution-free estimation of the missing mass. Up to our knowledge, the use of Bayesian ideas to study large sample asymptotics for the missing mass is new, and it could be of independent interest. Still relying on Bayesian nonparametric tools, we then show that under regularly varying type proportions the convergence rate of the Good-Turing estimator is the best rate that any estimator can achieve, up to a slowly varying function, and that minimax rate must be at least $n^{-\alpha/2}$. We conclude with a discussion of our results, and by conjecturing that the Good-Turing estimator is an rate optimal minimax estimator under regularly varying type proportions.
This Week's Top Stocks FB, DDD, AMZN, & TWTR Stock Forecasts Quantifying Uncertainty and Bayesian Inference
The U.S. cotton market has remained stable since its spike in 2011, when China executed its cotton reserving and fiber hoarding plan. It is believed that U.S. cotton demand and price were artificially kept low because there are always worries that China would unexpectedly unleash its cotton stockpile, about half of the global storage. However, U.S. cotton price finally showed a revival in recent days. The ICE July cotton futures closed at 95.21 cents a pound on Tuesday, June 12, the highest level for a front-month future contract in the last 6 years. The revival could be attributed to multiple factors, with an emphasis on the worries about insufficient rain in the cotton-growing areas and the newly issued import quotas from China.
Probabilistic Inference Using Generators - The Statues Algorithm
We present here a new probabilistic inference algorithm that gives exact results in the domain of discrete probability distributions. This algorithm, named the Statues algorithm, calculates the marginal probability distribution on probabilistic models defined as direct acyclic graphs. These models are made up of well-defined primitives that allow to express, in particular, joint probability distributions, Bayesian networks, discrete Markov chains, conditioning and probabilistic arithmetic. The Statues algorithm relies on a variable binding mechanism based on the generator construct, a special form of coroutine; being related to the enumeration algorithm, this new algorithm brings important improvements in terms of efficiency, which makes it valuable in regard to other exact marginalization algorithms. After introduction of several definitions, primitives and compositional rules, we present in details the Statues algorithm. Then, we briefly discuss the interest of this algorithm compared to others and we present possible extensions. Finally, we introduce Lea and MicroLea, two Python libraries implementing the Statues algorithm, along with several use cases.
Constructing Deep Neural Networks by Bayesian Network Structure Learning
Rohekar, Raanan Y. Yehezkel, Nisimov, Shami, Koren, Guy, Gurwicz, Yaniv, Novik, Gal
We introduce a principled approach for unsupervised structure learning of deep neural networks. We propose a new interpretation for depth and inter-layer connectivity where conditional independencies in the input distribution are encoded hierarchically in the network structure. Thus, the depth of the network is determined inherently (equal to the maximal order of independence in the input distribution). The proposed method casts the problem of neural network structure learning as a problem of Bayesian network structure learning. Then, instead of directly learning the discriminative structure, it learns a generative graph, constructs its stochastic inverse, and then constructs a discriminative graph. We prove that conditional-dependency relations among the latent variables in the generative graph are preserved in the class-conditional discriminative graph. We demonstrate on image classification benchmarks that the deepest layers (convolutional and dense) of common networks can be replaced by significantly smaller learned structures, while maintaining classification accuracy---state-of-the-art on tested benchmarks. Our structure learning algorithm requires a small computational cost and runs efficiently on a standard desktop CPU.