Goto

Collaborating Authors

 Statistical Learning


Localized Data Fusion for Kernel k-Means Clustering with Application to Cancer Biology

Neural Information Processing Systems

In many modern applications from, for example, bioinformatics and computer vision, samples have multiple feature representations coming from different data sources. Multiview learning algorithms try to exploit all these available information to obtain a better learner in such scenarios. In this paper, we propose a novel multiple kernel learning algorithm that extends kernel k-means clustering to the multiview setting, which combines kernels calculated on the views in a localized way to better capture sample-specific characteristics of the data. We demonstrate the better performance of our localized data fusion approach on a human colon and rectal cancer data set by clustering patients. Our method finds more relevant prognostic patient groups than global data fusion methods when we evaluate the results with respect to three commonly used clinical biomarkers.


DFacTo: Distributed Factorization of Tensors

Neural Information Processing Systems

We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. Byexploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two sparse matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million 2.5 million 1.5 million dimensional tensor with 1.2 billion nonzero entries.


The Large Margin Mechanism for Differentially Private Maximization

Neural Information Processing Systems

A basic problem in the design of privacy-preserving algorithms is the \emph{private maximization problem}: the goal is to pick an item from a universe that (approximately) maximizes a data-dependent function, all under the constraint of differential privacy. This problem has been used as a sub-routine in many privacy-preserving algorithms for statistics and machine learning. Previous algorithms for this problem are either range-dependent---i.e., their utility diminishes with the size of the universe---or only apply to very restricted function classes. This work provides the first general purpose, range-independent algorithm for private maximization that guarantees approximate differential privacy. Its applicability is demonstrated on two fundamental tasks in data mining and machine learning.


Extended and Unscented Gaussian Processes

Neural Information Processing Systems

We present two new methods for inference in Gaussian process (GP) models with general nonlinear likelihoods. Inference is based on a variational framework where a Gaussian posterior is assumed and the likelihood is linearized about the variational posterior mean using either a Taylor series expansion or statistical linearization. We show that the parameter updates obtained by these algorithms are equivalent to the state update equations in the iterative extended and unscented Kalman filters respectively, hence we refer to our algorithms as extended and unscented GPs. The unscented GP treats the likelihood as a 'black-box' by not requiring its derivative for inference, so it also applies to non-differentiable likelihood models. We evaluate the performance of our algorithms on a number of synthetic inversion problems and a binary classification dataset.


Divide-and-Conquer Learning by Anchoring a Conical Hull

Neural Information Processing Systems

We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ ``anchors'' lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme ``DCA'' that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.


Recovery of Coherent Data via Low-Rank Dictionary Pursuit

Neural Information Processing Systems

The recently established RPCA method provides a convenient way to restore low-rank matrices from grossly corrupted observations. While elegant in theory and powerful in reality, RPCA is not an ultimate solution to the low-rank matrix recovery problem. Indeed, its performance may not be perfect even when data are strictly low-rank. This is because RPCA ignores clustering structures of the data which are ubiquitous in applications. As the number of cluster grows, the coherence of data keeps increasing, and accordingly, the recovery performance of RPCA degrades. We show that the challenges raised by coherent data (i.e., data with high coherence) could be alleviated by Low-Rank Representation (LRR)~\cite{tpami_2013_lrr}, provided that the dictionary in LRR is configured appropriately. More precisely, we mathematically prove that if the dictionary itself is low-rank then LRR is immune to the coherence parameter which increases with the underlying cluster number. This provides an elementary principle for dealing with coherent data and naturally leads to a practical algorithm for obtaining proper dictionaries in unsupervised environments. Experiments on randomly generated matrices and real motion sequences verify our claims. See the full paper at arXiv:1404.4032.


From Stochastic Mixability to Fast Rates

Neural Information Processing Systems

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell,\mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.


Clustering from Labels and Time-Varying Graphs

Neural Information Processing Systems

We present a general framework for graph clustering where a label is observed to each pair of nodes. This allows a very rich encoding of various types of pairwise interactions between nodes. We propose a new tractable approach to this problem based on maximum likelihood estimator and convex optimization. We analyze our algorithm under a general generative model, and provide both necessary and sufficient conditions for successful recovery of the underlying clusters. Our theoretical results cover and subsume a wide range of existing graph clustering results including planted partition, weighted clustering and partially observed graphs. Furthermore, the result is applicable to novel settings including time-varying graphs such that new insights can be gained on solving these problems. Our theoretical findings are further supported by empirical results on both synthetic and real data.


Reducing the Rank in Relational Factorization Models by Including Observable Patterns

Neural Information Processing Systems

Tensor factorizations have become popular methods for learning from multi-relational data. In this context, the rank of a factorization is an important parameter that determines runtime as well as generalization ability. To determine conditions under which factorization is an efficient approach for learning from relational data, we derive upper and lower bounds on the rank required to recover adjacency tensors. Based on our findings, we propose a novel additive tensor factorization model for learning from latent and observable patterns in multi-relational data and present a scalable algorithm for computing the factorization. Experimentally, we show that the proposed approach does not only improve the predictive performance over pure latent variable methods but that it also reduces the required rank --- and therefore runtime and memory complexity --- significantly.


Dependent nonparametric trees for dynamic hierarchical clustering

Neural Information Processing Systems

Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections of time-dependent, infinite-dimensional trees that can be used to model evolving hierarchies, and present an efficient and scalable algorithm for performing approximate inference in such a model. We demonstrate the efficacy of our model and inference algorithm on both synthetic data and real-world document corpora.