Goto

Collaborating Authors

 Caramanis, Constantine


Robust compressed sensing of generative models

arXiv.org Machine Learning

The goal of compressed sensing is to estimate a high dimensional vector from an underdetermined system of noisy linear equations. In analogy to classical compressed sensing, here we assume a generative model as a prior, that is, we assume the vector is represented by a deep generative model $G: \mathbb{R}^k \rightarrow \mathbb{R}^n$. Classical recovery approaches such as empirical risk minimization (ERM) are guaranteed to succeed when the measurement matrix is sub-Gaussian. However, when the measurement matrix and measurements are heavy-tailed or have outliers, recovery may fail dramatically. In this paper we propose an algorithm inspired by the Median-of-Means (MOM). Our algorithm guarantees recovery for heavy-tailed data, even in the presence of outliers. Theoretically, our results show our novel MOM-based algorithm enjoys the same sample complexity guarantees as ERM under sub-Gaussian assumptions. Our experiments validate both aspects of our claims: other algorithms are indeed fragile and fail under heavy-tailed and/or corrupted data, while our approach exhibits the predicted robustness.


Robust Estimation of Tree Structured Ising Models

arXiv.org Machine Learning

We consider the task of learning Ising models when the signs of different random variables are flipped independently with possibly unequal, unknown probabilities. In this paper, we focus on the problem of robust estimation of tree-structured Ising models. Without any additional assumption of side information, this is an open problem. We first prove that this problem is unidentifiable, however, this unidentifiability is limited to a small equivalence class of trees formed by leaf nodes exchanging positions with their neighbors. Next, we propose an algorithm to solve the above problem with logarithmic sample complexity in the number of nodes and polynomial run-time complexity. Lastly, we empirically demonstrate that, as expected, existing algorithms are not inherently robust in the proposed setting whereas our algorithm correctly recovers the underlying equivalence class.


On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression

arXiv.org Machine Learning

We study the convergence rates of the EM algorithm for learning two-component mixed linear regression under all regimes of signal-to-noise ratio (SNR). We resolve a long-standing question that many recent results have attempted to tackle: we completely characterize the convergence behavior of EM, and show that the EM algorithm achieves minimax optimal sample complexity under all SNR regimes. In particular, when the SNR is sufficiently large, the EM updates converge to the true parameter $\theta^{*}$ at the standard parametric convergence rate $\mathcal{O}((d/n)^{1/2})$ after $\mathcal{O}(\log(n/d))$ iterations. In the regime where the SNR is above $\mathcal{O}((d/n)^{1/4})$ and below some constant, the EM iterates converge to a $\mathcal{O}({\rm SNR}^{-1} (d/n)^{1/2})$ neighborhood of the true parameter, when the number of iterations is of the order $\mathcal{O}({\rm SNR}^{-2} \log(n/d))$. In the low SNR regime where the SNR is below $\mathcal{O}((d/n)^{1/4})$, we show that EM converges to a $\mathcal{O}((d/n)^{1/4})$ neighborhood of the true parameters, after $\mathcal{O}((n/d)^{1/2})$ iterations. Notably, these results are achieved under mild conditions of either random initialization or an efficiently computable local initialization. By providing tight convergence guarantees of the EM algorithm in middle-to-low SNR regimes, we fill the remaining gap in the literature, and significantly, reveal that in low SNR, EM changes rate, matching the $n^{-1/4}$ rate of the MLE, a behavior that previous work had been unable to show.


Fast Algorithms for Robust PCA via Gradient Descent

Neural Information Processing Systems

We consider the problem of Robust PCA in the fully and partially observed settings. Without corruptions, this is the well-known matrix completion problem. From a statistical standpoint this problem has been recently well-studied, and conditions on when recovery is possible (how many observations do we need, how many corruptions can we tolerate) via polynomial-time algorithms is by now understood. This paper presents and analyzes a non-convex optimization approach that greatly reduces the computational complexity of the above problems, compared to the best available algorithms. In particular, in the fully observed case, with $r$ denoting rank and $d$ dimension, we reduce the complexity from $O(r 2d 2\log(1/\epsilon))$ to $O(rd 2\log(1/\epsilon))$ -- a big savings when the rank is big.


EM Algorithm is Sample-Optimal for Learning Mixtures of Well-Separated Gaussians

arXiv.org Machine Learning

We consider the problem of spherical Gaussian Mixture models with $k \geq 3$ components when the components are well separated. A fundamental previous result established that separation of $\Omega(\sqrt{\log k})$ is necessary and sufficient for identifiability of the parameters with polynomial sample complexity (Regev and Vijayaraghavan, 2017). We show that $\tilde{O}(kd/\epsilon^2)$ samples suffice, closing the gap from polynomial to linear, and thus giving the first sample-optimal upper bound for the parameter estimation of well-separated Gaussian mixtures (up to logarithmic factors). We accomplish this by proving a new result for the Expectation-Maximization (EM) algorithm: we show that EM converges locally, under separation $\Omega(\sqrt{\log k})$. The previous best-known guarantee required $\Omega(\sqrt{k})$ separation (Yan, et al., 2017). Unlike prior work, our results do not assume or use prior knowledge of the (potentially different) mixing weights or variances of the Gaussian components. Furthermore, our results show that the finite-sample error of EM does not depend on non-universal quantities such as pairwise distances between means of Gaussian components.


Communication-Efficient Asynchronous Stochastic Frank-Wolfe over Nuclear-norm Balls

arXiv.org Machine Learning

Large-scale machine learning training suffers from two prior challenges, specifically for nuclear-norm constrained problems with distributed systems: the synchronization slowdown due to the straggling workers, and high communication costs. In this work, we propose an asynchronous Stochastic Frank Wolfe (SFW-asyn) method, which, for the first time, solves the two problems simultaneously, while successfully maintaining the same convergence rate as the vanilla SFW. We implement our algorithm in python (with MPI) to run on Amazon EC2, and demonstrate that SFW-asyn yields speed-ups almost linear to the number of machines compared to the vanilla SFW.


Mix and Match: An Optimistic Tree-Search Approach for Learning Models from Mixture Distributions

arXiv.org Machine Learning

We consider a co-variate shift problem where one has access to several marginally different training datasets for the same learning problem and a small validation set which possibly differs from all the individual training distributions. This co-variate shift is caused, in part, due to unobserved features in the datasets. The objective, then, is to find the best mixture distribution over the training datasets (with only observed features) such that training a learning algorithm using this mixture has the best validation performance. Our proposed algorithm, ${\sf Mix\&Match}$, combines stochastic gradient descent (SGD) with optimistic tree search and model re-use (evolving partially trained models with samples from different mixture distributions) over the space of mixtures, for this task. We prove simple regret guarantees for our algorithm with respect to recovering the optimal mixture, given a total budget of SGD evaluations. Finally, we validate our algorithm on two real-world datasets.


More Supervision, Less Computation: Statistical-Computational Tradeoffs in Weakly Supervised Learning

arXiv.org Machine Learning

We consider the weakly supervised binary classification problem where the labels are randomly flipped with probability $1- {\alpha}$. Although there exist numerous algorithms for this problem, it remains theoretically unexplored how the statistical accuracies and computational efficiency of these algorithms depend on the degree of supervision, which is quantified by ${\alpha}$. In this paper, we characterize the effect of ${\alpha}$ by establishing the information-theoretic and computational boundaries, namely, the minimax-optimal statistical accuracy that can be achieved by all algorithms, and polynomial-time algorithms under an oracle computational model. For small ${\alpha}$, our result shows a gap between these two boundaries, which represents the computational price of achieving the information-theoretic boundary due to the lack of supervision. Interestingly, we also show that this gap narrows as ${\alpha}$ increases. In other words, having more supervision, i.e., more correct labels, not only improves the optimal statistical accuracy as expected, but also enhances the computational efficiency for achieving such accuracy.


Disentangling Mixtures of Epidemics on Graphs

arXiv.org Machine Learning

We consider the problem of learning the weighted edges of a mixture of two graphs from epidemic cascades. This is a natural setting in the context of social networks, where a post created by one user will not spread on the same graph if it is about basketball or if it is about politics. However, very little is known about whether this problem is solvable. To the best of our knowledge, we establish the first conditions under which this problem can be solved, and provide conditions under which the problem is provably not solvable. When the conditions are met, i.e. when the graphs are connected, with distinct edges, and have at least three edges, we give an efficient algorithm for learning the weights of both graphs with almost optimal sample complexity (up to log factors). We extend the results to the setting in which the priors of the mixture are unknown and obtain similar guarantees.


Primal-Dual Block Frank-Wolfe

arXiv.org Machine Learning

Particularly, we are interested in problems whose solution has special "simple" structure like low-rank or sparsity. The sparsity constraint applies to large-scale multiclass/multi-label classification, low-degree polynomial data mapping [5], random feature kernel machines [32], and Elastic Net [39]. Motivated by recent applications in low-rank multi-class SVM, phase retrieval, matrix completion, affine rank minimization and other problems (e.g., [9, 31, 2, 3]), we also consider settings where the constraint x C (e.g., trace norm ball) while convex, may be difficult to project onto. A wish-list for this class of problems would include an algorithm that (1) exploits the function finite-sum form and the simple structure of the solution, (2) achieves linear convergence for smooth and strongly convex problems, (3) does not pay a heavy price for the projection step. We propose a Frank-Wolfe (FW) type method that attains these three goals. This does not come without challenges: Although it is currently well-appreciated that FW type algorithms avoid the cost of projection [14, 1], the benefits are limited to constraints that are hard to project onto, like the trace norm ball. For problems like phase retrieval and ERM for multi-label multi-class classification, the gradient computation requires large matrix multiplications. This dominates the per-iteration cost, and the existing FW type methods do not asymptotically reduce time complexity per iteration, even without paying the expensive projection step.