Goto

Collaborating Authors

 Bayesian Inference


Stein Variational Gaussian Processes

arXiv.org Machine Learning

We show how to use Stein variational gradient descent (SVGD) to carry out inference in Gaussian process (GP) models with non-Gaussian likelihoods and large data volumes. Markov chain Monte Carlo (MCMC) is extremely computationally intensive for these situations, but the parametric assumptions required for efficient variational inference (VI) result in incorrect inference when they encounter the multi-modal posterior distributions that are common for such models. SVGD provides a non-parametric alternative to variational inference which is substantially faster than MCMC but unhindered by parametric assumptions. We prove that for GP models with Lipschitz gradients the SVGD algorithm monotonically decreases the Kullback-Leibler divergence from the sampling distribution to the true posterior. Our method is demonstrated on benchmark problems in both regression and classification, and a real air quality example with 11440 spatiotemporal observations, showing substantial performance improvements over MCMC and VI.


Multilevel Gibbs Sampling for Bayesian Regression

arXiv.org Machine Learning

Bayesian regression remains a simple but effective tool based on Bayesian inference techniques. For large-scale applications, with complicated posterior distributions, Markov Chain Monte Carlo methods are applied. To improve the well-known computational burden of Markov Chain Monte Carlo approach for Bayesian regression, we developed a multilevel Gibbs sampler for Bayesian regression of linear mixed models. The level hierarchy of data matrices is created by clustering the features and/or samples of data matrices. Additionally, the use of correlated samples is investigated for variance reduction to improve the convergence of the Markov Chain. Testing on a diverse set of data sets, speed-up is achieved for almost all of them without significant loss in predictive performance.


Resource-Constrained On-Device Learning by Dynamic Averaging

arXiv.org Machine Learning

The communication between data-generating devices is partially responsible for a growing portion of the world's power consumption. Thus reducing communication is vital, both, from an economical and an ecological perspective. For machine learning, on-device learning avoids sending raw data, which can reduce communication substantially. Furthermore, not centralizing the data protects privacy-sensitive data. However, most learning algorithms require hardware with high computation power and thus high energy consumption. In contrast, ultra-low-power processors, like FPGAs or micro-controllers, allow for energy-efficient learning of local models. Combined with communication-efficient distributed learning strategies, this reduces the overall energy consumption and enables applications that were yet impossible due to limited energy on local devices. The major challenge is then, that the low-power processors typically only have integer processing capabilities. This paper investigates an approach to communication-efficient on-device learning of integer exponential families that can be executed on low-power processors, is privacy-preserving, and effectively minimizes communication. The empirical evaluation shows that the approach can reach a model quality comparable to a centrally learned regular model with an order of magnitude less communication. Comparing the overall energy consumption, this reduces the required energy for solving the machine learning task by a significant amount.


Finite mixture models do not reliably learn the number of components

arXiv.org Machine Learning

Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. A common suggestion is to use a finite mixture model (FMM) with a prior on the number of components. Past work has shown the resulting FMM component-count posterior is consistent; that is, the posterior concentrates on the true generating number of components. But existing results crucially depend on the assumption that the component likelihoods are perfectly specified. In practice, this assumption is unrealistic, and empirical evidence suggests that the FMM posterior on the number of components is sensitive to the likelihood choice. In this paper, we add rigor to data-analysis folk wisdom by proving that under even the slightest model misspecification, the FMM component-count posterior diverges: the posterior probability of any particular finite number of latent components converges to 0 in the limit of infinite data. We illustrate practical consequences of our theory on simulated and real data sets.


A Rigorous Link Between Self-Organizing Maps and Gaussian Mixture Models

arXiv.org Machine Learning

This work presents a mathematical treatment of the relation between Self-Organizing Maps (SOMs) and Gaussian Mixture Models (GMMs). We show that energy-based SOM models can be interpreted as performing gradient descent, minimizing an approximation to the GMM log-likelihood that is particularly valid for high data dimensionalities. The SOM-like decrease of the neighborhood radius can be understood as an annealing procedure ensuring that gradient descent does not get stuck in undesirable local minima. This link allows to treat SOMs as generative probabilistic models, giving a formal justification for using SOMs, e.g., to detect outliers, or for sampling.


Bayesian Topological Learning for Classifying the Structure of Biological Networks

arXiv.org Machine Learning

Actin cytoskeleton networks generate local topological signatures due to the natural variations in the number, size, and shape of holes of the networks. Persistent homology is a method that explores these topological properties of data and summarizes them as persistence diagrams. In this work, we analyze and classify these filament networks by transforming them into persistence diagrams whose variability is quantified via a Bayesian framework on the space of persistence diagrams. The proposed generalized Bayesian framework adopts an independent and identically distributed cluster point process characterization of persistence diagrams and relies on a substitution likelihood argument. This framework provides the flexibility to estimate the posterior cardinality distribution of points in a persistence diagram and the posterior spatial distribution simultaneously. We present a closed form of the posteriors under the assumption of Gaussian mixtures and binomials for prior intensity and cardinality respectively. Using this posterior calculation, we implement a Bayes factor algorithm to classify the actin filament networks and benchmark it against several state-of-the-art classification methods.


Bandit Change-Point Detection for Real-Time Monitoring High-Dimensional Data Under Sampling Control

arXiv.org Machine Learning

In these applications, one often can only observe or use selected components of the data for decisionmaking due to the capacity limitation in data acquisition, transmission, processing, or storage. For instance, the sensor devices might have limited battery powers; thus, one might want to use a subset of sensors per time step over a long period instead of using full sensors simultaneously over a short period. Likewise, while sensing is usually cheap, the communication bandwidth is often limited from remote sensors to the fusion center that makes a global decision. The fusion center might prioritize certain local sensors to send local information for decision making. Also, in many applications such as quality engineering or biosurveillance, one faces the design issue and needs to decide which variables or patients to be measured to detect the defect or disease outbreak more efficiently. This paper aims to investigate how to efficiently real-time monitor high-dimensional streaming data under resource constraints.


Parsimonious Feature Extraction Methods: Extending Robust Probabilistic Projections with Generalized Skew-t

arXiv.org Machine Learning

The study focuses on extension to the approach of Principal Component Analysis (PCA), as defined in [1], [2] or [3]. PCA and related matrix factorisation methodologies are widely used in data-rich environments for dimensionality reduction, data compression, feature-extraction techniques or data de-noising. The methodologies identify a lower-dimensional linear subspace to represent the data, which captures second-order dominant information contained in high-dimensional data sets. PCA can be viewed as a matrix factorisation problem which aims to learn the lower-dimensional representation of the data, preserving its Euclidean structure. However, in the presence of either a non-Gaussian distribution of the data generating distribution or in the presence of outliers which corrupt the data, the standard PCA methodology provides biased information about the lower-rank representation. In many applications, the stochastic noise or observation errors in the data set are assumed to be, in some sense, "well-behaved"; for instance, additive, light-tailed, symmetric and zero-mean. When non-robust feature extraction methods are naively utilised in the presence of violations of these implicit statistical assumptions, the information contained in the extracted features cannot be relied upon, resulting in misleading inference. Therefore, it is critical to ensure that the feature extraction captures information about correct characteristics of the process generating the data. In the following study, we relax the inherent assumption of "well-behaved" observation noise by developing a class of robust estimators that can withstand violations of such assumptions, which routinely arise in real data sets.


Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases

arXiv.org Artificial Intelligence

Equipping machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canonicalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods.


Representation Learning from Limited Educational Data with Crowdsourced Labels

arXiv.org Artificial Intelligence

Representation learning has been proven to play an important role in the unprecedented success of machine learning models in numerous tasks, such as machine translation, face recognition and recommendation. The majority of existing representation learning approaches often require a large number of consistent and noise-free labels. However, due to various reasons such as budget constraints and privacy concerns, labels are very limited in many real-world scenarios. Directly applying standard representation learning approaches on small labeled data sets will easily run into over-fitting problems and lead to sub-optimal solutions. Even worse, in some domains such as education, the limited labels are usually annotated by multiple workers with diverse expertise, which yields noises and inconsistency in such crowdsourcing settings. In this paper, we propose a novel framework which aims to learn effective representations from limited data with crowdsourced labels. Specifically, we design a grouping based deep neural network to learn embeddings from a limited number of training samples and present a Bayesian confidence estimator to capture the inconsistency among crowdsourced labels. Furthermore, to expedite the training process, we develop a hard example selection procedure to adaptively pick up training examples that are misclassified by the model. Extensive experiments conducted on three real-world data sets demonstrate the superiority of our framework on learning representations from limited data with crowdsourced labels, comparing with various state-of-the-art baselines. In addition, we provide a comprehensive analysis on each of the main components of our proposed framework and also introduce the promising results it achieved in our real production to fully understand the proposed framework.