Goto

Collaborating Authors

 Statistical Learning


Learning Mixed Membership Mallows Models from Pairwise Comparisons

arXiv.org Machine Learning

We propose a novel parameterized family of Mixed Membership Mallows Models (M4) to account for variability in pairwise comparisons generated by a heterogeneous population of noisy and inconsistent users. M4 models individual preferences as a user-specific probabilistic mixture of shared latent Mallows components. Our key algorithmic insight for estimation is to establish a statistical connection between M4 and topic models by viewing pairwise comparisons as words, and users as documents. This key insight leads us to explore Mallows components with a separable structure and leverage recent advances in separable topic discovery. While separability appears to be overly restrictive, we nevertheless show that it is an inevitable outcome of a relatively small number of latent Mallows components in a world of large number of items. We then develop an algorithm based on robust extreme-point identification of convex polygons to learn the reference rankings, and is provably consistent with polynomial sample complexity guarantees. We demonstrate that our new model is empirically competitive with the current state-of-the-art approaches in predicting real-world preferences.


On model misspecification and KL separation for Gaussian graphical models

arXiv.org Machine Learning

We establish bounds on the KL divergence between two multivariate Gaussian distributions in terms of the Hamming distance between the edge sets of the corresponding graphical models. We show that the KL divergence is bounded below by a constant when the graphs differ by at least one edge; this is essentially the tightest possible bound, since classes of graphs exist for which the edge discrepancy increases but the KL divergence remains bounded above by a constant. As a natural corollary to our KL lower bound, we also establish a sample size requirement for correct model selection via maximum likelihood estimation. Our results rigorize the notion that it is essential to estimate the edge structure of a Gaussian graphical model accurately in order to approximate the true distribution to close precision.


Beta diffusion trees and hierarchical feature allocations

arXiv.org Machine Learning

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.


The Approximation of the Dissimilarity Projection

arXiv.org Machine Learning

Diffusion magnetic resonance imaging (dMRI) data allow to reconstruct the 3D pathways of axons within the white matter of the brain as a tractography. The analysis of tractographies has drawn attention from the machine learning and pattern recognition communities providing novel challenges such as finding an appropriate representation space for the data. Many of the current learning algorithms require the input to be from a vectorial space. This requirement contrasts with the intrinsic nature of the tractography because its basic elements, called streamlines or tracks, have different lengths and different number of points and for this reason they cannot be directly represented in a common vectorial space. In this work we propose the adoption of the dissimilarity representation which is an Euclidean embedding technique defined by selecting a set of streamlines called prototypes and then mapping any new streamline to the vector of distances from prototypes. We investigate the degree of approximation of this projection under different prototype selection policies and prototype set sizes in order to characterise its use on tractography data. Additionally we propose the use of a scalable approximation of the most effective prototype selection policy that provides fast and accurate dissimilarity approximations of complete tractographies.


Gradient-based Hyperparameter Optimization through Reversible Learning

arXiv.org Machine Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.


Local Identification of Overcomplete Dictionaries

arXiv.org Machine Learning

This paper presents the first theoretical results showing that stable identification of overcomplete $\mu$-coherent dictionaries $\Phi \in \mathbb{R}^{d\times K}$ is locally possible from training signals with sparsity levels $S$ up to the order $O(\mu^{-2})$ and signal to noise ratios up to $O(\sqrt{d})$. In particular the dictionary is recoverable as the local maximum of a new maximisation criterion that generalises the K-means criterion. For this maximisation criterion results for asymptotic exact recovery for sparsity levels up to $O(\mu^{-1})$ and stable recovery for sparsity levels up to $O(\mu^{-2})$ as well as signal to noise ratios up to $O(\sqrt{d})$ are provided. These asymptotic results translate to finite sample size recovery results with high probability as long as the sample size $N$ scales as $O(K^3dS \tilde \varepsilon^{-2})$, where the recovery precision $\tilde \varepsilon$ can go down to the asymptotically achievable precision. Further, to actually find the local maxima of the new criterion, a very simple Iterative Thresholding and K (signed) Means algorithm (ITKM), which has complexity $O(dKN)$ in each iteration, is presented and its local efficiency is demonstrated in several experiments.


Strong oracle optimality of folded concave penalized estimation

arXiv.org Machine Learning

Folded concave penalization methods have been shown to enjoy the strong oracle property for high-dimensional sparse estimation. However, a folded concave penalization problem usually has multiple local solutions and the oracle property is established only for one of the unknown local solutions. A challenging fundamental issue still remains that it is not clear whether the local optimum computed by a given optimization algorithm possesses those nice theoretical properties. To close this important theoretical gap in over a decade, we provide a unified theory to show explicitly how to obtain the oracle solution via the local linear approximation algorithm. For a folded concave penalized estimation problem, we show that as long as the problem is localizable and the oracle estimator is well behaved, we can obtain the oracle estimator by using the one-step local linear approximation. In addition, once the oracle estimator is obtained, the local linear approximation algorithm converges, namely it produces the same estimator in the next iteration. The general theory is demonstrated by using four classical sparse estimation problems, that is, sparse linear regression, sparse logistic regression, sparse precision matrix estimation and sparse quantile regression.


Signatures of Infinity: Nonergodicity and Resource Scaling in Prediction, Complexity, and Learning

arXiv.org Machine Learning

Truly complex stochastic processes--the infinitary processes [1] whose mutual information between past and future diverges--arise in many physical and biological systems [2-5], such as those in critical states. They are implicated in many natural phenomena, from the geophysics of earthquakes [6] and physiological measurements of neural avalanches [7] to semantics in natural language [8] and cascading failure in power transmission grids [9]. Their apparent infinite memory makes empirical estimation and modeling particularly challenging. The difficulty is reflected in the computational complexity of inference [10]: the resources required to predict and model them diverge in sample size, in memory for storing model parameters, and in memory required for prediction. Resource scaling, an analog of the venerable technique of finite-size scaling in statistical mechanics, suggests that for infinitary processes we look for statistical signatures that track divergences. Since resource divergences are sensitive to a process's inherent randomness and organization, one hopes that their scaling forms are uniquely revealing indicators of process complexity and can guide the selection of appropriate models. To date, though, there are few tractable constructions with which to explore possible general relationships between prediction, complexity, and learning for infinitary processes.


Bayesian Clustering of Shapes of Curves

arXiv.org Machine Learning

The general goal here is to choose groups (clusters) of objects so as to maximize homogeneity within clusters and minimize homogeneity across clusters. The clustering problem has been addressed by researchers in many disciplines. A few well-known methods are metric based e.g. K-means (MacQueen et al., 1967), hierarchical clustering (Ward, 1963), clustering based on principal components, spectral clustering (Ng et al., 2002) and so on (Jain and Dubes, 1988; Ozawa, 1985). Traditional clustering methods are complemented by methods based on a probability model where one assumes a data generating distribution (e.g., Gaussian) and infers clustering configurations that maximize certain objective function (Banfield and Raftery, 1993; Fraley and Raftery, 1998, 2002, 2006; MacCullagh and Yang, 2008). A modelbased clustering can be useful in addressing challenges posed by traditional clustering methods. This is because a probability model allows the number of clusters to be treated as a parameter in the model, and can be embedded in a Bayesian framework providing quantification of uncertainty in the number of clusters and clustering configurations.


Iterative Regularization for Learning with Convex Loss Functions

arXiv.org Machine Learning

We consider the problem of supervised learning with convex loss functions and propose a new form of iterative regularization based on the subgradient method. Unlike other regularization approaches, in iterative regularization no constraint or penalization is considered, and generalization is achieved by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of reproducing kernel Hilbert spaces, and prove finite sample bounds on the excess risk under general regularity conditions. Our study provides a new class of efficient regularized learning algorithms and gives insights on the interplay between statistics and optimization in machine learning.