Uncertainty
Infinite Sparse Structured Factor Analysis
Pearce, Matthew C., White, Simon R.
Matrix factorisation methods decompose multivariate observations as linear combinations of latent feature vectors. The Indian Buffet Process (IBP) provides a way to model the number of latent features required for a good approximation in terms of regularised reconstruction error. Previous work has focussed on latent feature vectors with independent entries. We extend the model to include nondiagonal latent covariance structures representing characteristics such as smoothness. This is done by . Using simulations we demonstrate that under appropriate conditions a smoothness prior helps to recover the true latent features, while denoising more accurately. We demonstrate our method on a real neuroimaging dataset, where computational tractability is a sufficient challenge that the efficient strategy presented here is essential.
Beyond Uniform Priors in Bayesian Network Structure Learning
Bayesian network structure learning is often performed in a Bayesian setting, evaluating candidate structures using their posterior probabilities for a given data set. Score-based algorithms then use those posterior probabilities as an objective function and return the maximum a posteriori network as the learned model. For discrete Bayesian networks, the canonical choice for a posterior score is the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood with a uniform (U) graph prior, which assumes a uniform prior both on the network structures and on the parameters of the networks. In this paper, we investigate the problems arising from these assumptions, focusing on those caused by small sample sizes and sparse data. We then propose an alternative posterior score: the Bayesian Dirichlet sparse (BDs) marginal likelihood with a marginal uniform (MU) graph prior. Like U BDeu, MU BDs does not require any prior information on the probabilistic structure of the data and can be used as a replacement noninformative score. We study its theoretical properties and we evaluate its performance in an extensive simulation study, showing that MU BDs is both more accurate than U BDeu in learning the structure of the network and competitive in predicting power, while not being computationally more complex to estimate.
Sampling-based speech parameter generation using moment-matching networks
Takamichi, Shinnosuke, Koriyama, Tomoki, Saruwatari, Hiroshi
This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variation in synthetic speech. To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters. The DNNs are trained so that they make the moments of generated speech parameters close to those of natural speech parameters. Since the variation of speech parameters is compressed into a low-dimensional simple prior noise vector, our algorithm has lower computation cost than direct sampling of speech parameters. As the first step towards generating synthetic speech that has natural inter-utterance variation, this paper investigates whether or not the proposed sampling-based generation deteriorates synthetic speech quality. In evaluation, we compare speech quality of conventional maximum likelihood-based generation and proposed sampling-based generation. The result demonstrates the proposed generation causes no degradation in speech quality.
The Stochastic complexity of spin models: How simple are simple spin models?
Beretta, Alberto, Battistin, Claudia, de Mulatier, Clélia, Mastromatteo, Iacopo, Marsili, Matteo
The Stochastic complexity of spin models: How simple are simple spin models? Alberto Beretta, 1 Claudia Battistin, 2 Cl elia de Mulatier, 1 Iacopo Mastromatteo, 3 and Matteo Marsili 1 1 The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, I-34014 Trieste, Italy 2 Kavli Institute for Systems Neuroscience and Centre for Neural Computation, Olav Kyrres gate 9, 7030 Trondheim, Norway 3 Capital Fund Management, 23 rue de l'Universit e, 75007 Paris, France Simple models, in information theoretic terms, are those with a small stochastic complexity. We study the stochastic complexity of spin models with interactions of arbitrary order. Invariance with respect to bijections within the space of operators allows us to classify models in complexity classes. This invariance also shows that simplicity is not related to the order of the interactions, but rather to their mutual arrangement.
Approximate Kernel-based Conditional Independence Tests for Fast Non-Parametric Causal Discovery
Strobl, Eric V., Zhang, Kun, Visweswaran, Shyam
Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales cubicly with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time.
Weak Adaptive Submodularity and Group-Based Active Diagnosis with Applications to State Estimation with Persistent Sensor Faults
Yong, Sze Zheng, Gao, Lingyun, Ozay, Necmiye
In this paper, we consider adaptive decision-making problems for stochastic state estimation with partial observations. First, we introduce the concept of weak adaptive submodularity, a generalization of adaptive submodularity, which has found great success in solving challenging adaptive state estimation problems. Then, for the problem of active diagnosis, i.e., discrete state estimation via active sensing, we show that an adaptive greedy policy has a near-optimal performance guarantee when the reward function possesses this property. We further show that the reward function for group-based active diagnosis, which arises in applications such as medical diagnosis and state estimation with persistent sensor faults, is also weakly adaptive submodular. Finally, in experiments of state estimation for an aircraft electrical system with persistent sensor faults, we observe that an adaptive greedy policy performs equally well as an exhaustive search.
Gaussian variational approximation with sparse precision matrices
Tan, Linda S. L., Nott, David J.
The stochastic gradients constructed in this manner are "doubly stochastic" as they are built upon two sources of stochasticity that comes from sampling from the variational distribution and the full data set. This approach is very general in that it can be applied to any model where the joint density is differentiable. Unlike variational Bayes, it does not assume independence relationships among blocks of an appropriate partition of θ. Such independence assumptions have been shown to result in underestimation of the posterior variance (Wang and Titterington, 2005; Bishop, 2006). The quality of the resulting approximation is thus limited only by how well the form of q(θ) matches the true posterior. Using this approach, Kucukelbir et al. (2016) develop an automatic differentiation variational inference (ADVI) algorithm in Stan, where q(θ) is assumed to be either a diagonal (meanfield) or unrestricted Gaussian variational approximation. Constrained variables are transformed to the real line via Stan's library of transformations and the gradients are computed using Monte Carlo integration. They note that while unrestricted ADVI is able to capture posterior correlations and hence produces more accurate marginal variance estimates than mean field ADVI, it can be prohibitively slow for large data since the number of variational parameters scales as the square of the length of θ. In this article, we consider variational approximations which take the form of a multivariate Gaussian distribution N(µ, Σ) for models with high-dimensional parameters (µ denotes the mean and Σ the covariance matrix).
Bay Area Probabilistic Programming Meetup
Is probabilistic programming and Bayesian reasoning algorithms the next big thing in machine learning? The idea behind the probabilistic programming to machine learning is that the model of the data can be separated from the algorithms that do inference on the model. The allows you to devote your energy to building models tailored to your decision problem, as opposed to constraining your problem so it works with some machine learning tool. This idea opens machine learning to domain experts. Indeed, probabilistic programming grew out of probabilistic graphical models, which revolutionized AI by enabling expert knowledge to be built into graphs powered by Bayesian inference.
Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA
Papanikolaou, Yannis, Foulds, James R., Rubin, Timothy N., Tsoumakas, Grigorios
We introduce a novel approach for estimating Latent Dirichlet Allocation (LDA) parameters from collapsed Gibbs samples (CGS), by leveraging the full conditional distributions over the latent variable assignments to efficiently average over multiple samples, for little more computational cost than drawing a single additional collapsed Gibbs sample. Our approach can be understood as adapting the soft clustering methodology of Collapsed Variational Bayes (CVB0) to CGS parameter estimation, in order to get the best of both techniques. Our estimators can straightforwardly be applied to the output of any existing implementation of CGS, including modern accelerated variants. We perform extensive empirical comparisons of our estimators with those of standard collapsed inference algorithms on real-world data for both unsupervised LDA and Prior-LDA, a supervised variant of LDA for multi-label classification. Our results show a consistent advantage of our approach over traditional CGS under all experimental conditions, and over CVB0 inference in the majority of conditions. More broadly, our results highlight the importance of averaging over multiple samples in LDA parameter estimation, and the use of efficient computational techniques to do so.
10 Free Must-Read Books for Machine Learning and Data Science
This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.