Goto

Collaborating Authors

 Bayesian Inference


Latent Distribution Assumption for Unbiased and Consistent Consensus Modelling

arXiv.org Machine Learning

We study the problem of aggregation noisy labels. Usually, it is solved by proposing a stochastic model for the process of generating noisy labels and then estimating the model parameters using the observed noisy labels. A traditional assumption underlying previously introduced generative models is that each object has one latent true label. In contrast, we introduce a novel latent distribution assumption, implying that a unique true label for an object might not exist, but rather each object might have a specific distribution generating a latent subjective label each time the object is observed. Our experiments showed that the novel assumption is more suitable for difficult tasks, when there is an ambiguity in choosing a "true" label for certain objects.


Bayesian inverse regression for supervised dimension reduction with small datasets

arXiv.org Machine Learning

We consider supervised dimension reduction problems, namely to identify a low dimensional projection of the predictors $\-x$ which can retain the statistical relationship between $\-x$ and the response variable $y$. We follow the idea of the sliced inverse regression (SIR) class of methods, which is to use the statistical information of the conditional distribution $\pi(\-x|y)$ to identify the dimension reduction (DR) space and in particular we focus on the task of computing this conditional distribution. We propose a Bayesian framework to compute the conditional distribution where the likelihood function is obtained using the Gaussian process regression model. The conditional distribution $\pi(\-x|y)$ can then be obtained directly by assigning weights to the original data points. We then can perform DR by considering certain moment functions (e.g. the first moment) of the samples of the posterior distribution. With numerical examples, we demonstrate that the proposed method is especially effective for small data problems.


Adversarial Self-Paced Learning for Mixture Models of Hawkes Processes

arXiv.org Machine Learning

We propose a novel adversarial learning strategy for mixture models of Hawkes processes, leveraging data augmentation techniques of Hawkes process in the framework of self-paced learning. Instead of learning a mixture model directly from a set of event sequences drawn from different Hawkes processes, the proposed method learns the target model iteratively, which generates "easy" sequences and uses them in an adversarial and self-paced manner. In each iteration, we first generate a set of augmented sequences from original observed sequences. Based on the fact that an easy sample of the target model can be an adversarial sample of a misspecified model, we apply a maximum likelihood estimation with an adversarial self-paced mechanism. In this manner the target model is updated, and the augmented sequences that obey it are employed for the next learning iteration. Experimental results show that the proposed method outperforms traditional methods consistently.


The Broad Optimality of Profile Maximum Likelihood

arXiv.org Machine Learning

We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ distance, PML yields optimal $\Theta(k/(\varepsilon^2\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; $\textbf{Additive property estimation}$ For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; $\boldsymbol{\alpha}\textbf{-R\'enyi entropy estimation}$ For integer $\alpha>1$, the PML plug-in estimator has optimal $k^{1-1/\alpha}$ sample complexity; for non-integer $\alpha>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art; $\textbf{Identity testing}$ In testing whether an unknown distribution is equal to or at least $\varepsilon$ far from a given distribution in $\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$. With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.


Minimum Stein Discrepancy Estimators

arXiv.org Machine Learning

When maximum likelihood estimation is infeasible, one often turns to score matching, contrastive divergence, or minimum probability flow learning to obtain tractable parameter estimates. We provide a unifying perspective of these techniques as minimum Stein discrepancy estimators and use this lens to design new diffusion kernel Stein discrepancy (DKSD) and diffusion score matching (DSM) estimators with complementary strengths. We establish the consistency, asymptotic normality, and robustness of DKSD and DSM estimators, derive stochastic Riemannian gradient descent algorithms for their efficient optimization, and demonstrate their advantages over score matching in models with non-smooth densities or heavy tailed distributions.


Variational Gaussian Processes with Signature Covariances

arXiv.org Machine Learning

We introduce a Bayesian approach to learn from stream-valued data by using Gaussian processes with the recently introduced signature kernel as covariance function. To cope with the computational complexity in time and memory that arises with long streams that evolve in large state spaces, we develop a variational Bayes approach with sparse inducing tensors. We provide an implementation based on GPFlow and benchmark this variational Gaussian process model on supervised classification tasks for time series and text (a stream of words).


Efficient Algorithms for Set-Valued Prediction in Multi-Class Classification

arXiv.org Machine Learning

In cases of uncertainty, a multi-class classifier preferably returns a set of candidate classes instead of predicting a single class label with little guarantee. More precisely, the classifier should strive for an optimal balance between the correctness (the true class is among the candidates) and the precision (the candidates are not too many) of its prediction. We formalize this problem within a general decision-theoretic framework that unifies most of the existing work in this area. In this framework, uncertainty is quantified in terms of conditional class probabilities, and the quality of a predicted set is measured in terms of a utility function. We then address the problem of finding the Bayes-optimal prediction, i.e., the subset of class labels with highest expected utility. For this problem, which is computationally challenging as there are exponentially (in the number of classes) many predictions to choose from, we propose efficient algorithms that can be applied to a broad family of utility scores. Two of these algorithms make use of structural information in the form of a class hierarchy, which is often available in prediction problems with many classes. Our theoretical results are complemented by experimental studies, in which we analyze the proposed algorithms in terms of predictive accuracy and runtime efficiency.


On The Radon--Nikodym Spectral Approach With Optimal Clustering

arXiv.org Machine Learning

Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.


Differentiable probabilistic models of scientific imaging with the Fourier slice theorem

arXiv.org Machine Learning

Scientific imaging techniques such as optical and electron microscopy and computed tomography (CT) scanning are used to study the 3D structure of an object through 2D observations. These observations are related to the original 3D object through orthogonal integral projections. For common 3D reconstruction algorithms, computational efficiency requires the modeling of the 3D structures to take place in Fourier space by applying the Fourier slice theorem. At present, it is unclear how to differentiate through the projection operator, and hence current learning algorithms can not rely on gradient based methods to optimize 3D structure models. In this paper we show how back-propagation through the projection operator in Fourier space can be achieved. We demonstrate the validity of the approach with experiments on 3D reconstruction of proteins. We further extend our approach to learning probabilistic models of 3D objects. This allows us to predict regions of low sampling rates or estimate noise. A higher sample efficiency can be reached by utilizing the learned uncertainties of the 3D structure as an unsupervised estimate of the model fit. Finally, we demonstrate how the reconstruction algorithm can be extended with an amortized inference scheme on unknown attributes such as object pose. Through empirical studies we show that joint inference of the 3D structure and the object pose becomes more difficult when the ground truth object contains more symmetries. Due to the presence of for instance (approximate) rotational symmetries, the pose estimation can easily get stuck in local optima, inhibiting a fine-grained high-quality estimate of the 3D structure.


Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

arXiv.org Machine Learning

Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.