Goto

Collaborating Authors

 Jordan


Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

Neural Information Processing Systems

We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with $M \geq 3$ components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least $1-e^{-\Omega(M)}$. We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.


Cyclades: Conflict-free Asynchronous Machine Learning

Neural Information Processing Systems

We present Cyclades, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting. Cyclades is asynchronous during model updates, and requires no memory locking mechanisms, similar to Hogwild!-type algorithms. Unlike Hogwild!, Cyclades introduces no conflicts during parallel execution, and offers a black-box analysis for provable speedups across a large family of algorithms. Due to its inherent cache locality and conflict-free nature, our multi-core implementation of Cyclades consistently outperforms Hogwild!-type algorithms on sufficiently sparse datasets, leading to up to 40% speedup gains compared to Hogwild!, and up to 5\times gains over asynchronous implementations of variance reduction algorithms.


Stochastic Cubic Regularization for Fast Nonconvex Optimization

Neural Information Processing Systems

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\mathcal{\tilde{O}}(\epsilon^{-3.5})$ stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\mathcal{\tilde{O}}(\epsilon^{-4})$ rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.


On the Local Minima of the Empirical Risk

Neural Information Processing Systems

Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex non-smooth losses (such as modern deep networks), the population risk is generally significantly more well behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function $F$ (population risk) given only access to an approximation $f$ (empirical risk) that is pointwise close to $F$ (i.e., $\norm{F-f}_{\infty} \le \nu$). Our objective is to find the $\epsilon$-approximate local minima of the underlying function $F$ while avoiding the shallow local minima---arising because of the tolerance $\nu$---which exist only in $f$. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of $f$ that is guaranteed to achieve our goal as long as $\nu \le O(\epsilon^{1.5}/d)$. We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance $\nu$ among all algorithms making a polynomial number of queries of $f$. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.


Is Q-Learning Provably Efficient?

Neural Information Processing Systems

Model-free reinforcement learning (RL) algorithms directly parameterize and update value functions or policies, bypassing the modeling of the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that they require large numbers of samples to learn. The theoretical question of whether not model-free algorithms are in fact \emph{sample efficient} is one of the most fundamental questions in RL. The problem is unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tlO(\sqrt{H^3 SAT})$ where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. Our regret matches the optimal regret up to a single $\sqrt{H}$ factor. Thus we establish the sample efficiency of a classical model-free approach. Moreover, to the best of our knowledge, this is the first model-free analysis to establish $\sqrt{T}$ regret \emph{without} requiring access to a ``simulator.''


Theoretical guarantees for EM under misspecified Gaussian mixture models

Neural Information Processing Systems

Recent years have witnessed substantial progress in understanding the behavior of EM for mixture models that are correctly specified. Given that model misspecification is common in practice, it is important to understand EM in this more general setting. We provide non-asymptotic guarantees for population and sample-based EM for parameter estimation under a few specific univariate settings of misspecified Gaussian mixture models. Due to misspecification, the EM iterates no longer converge to the true model and instead converge to the projection of the true model over the set of models being searched over. We provide two classes of theoretical guarantees: first, we characterize the bias introduced due to the misspecification; and second, we prove that population EM converges at a geometric rate to the model projection under a suitable initialization condition. This geometric convergence rate for population EM imply a statistical complexity of order $1/\sqrt{n}$ when running EM with $n$ samples.


Conditional Adversarial Domain Adaptation

Neural Information Processing Systems

Adversarial learning has been embedded into deep networks to learn disentangled and transferable representations for domain adaptation. Existing adversarial domain adaptation methods may struggle to align different domains of multimodal distributions that are native in classification problems. In this paper, we present conditional adversarial domain adaptation, a principled framework that conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Conditional domain adversarial networks (CDANs) are designed with two novel conditioning strategies: multilinear conditioning that captures the cross-covariance between feature representations and classifier predictions to improve the discriminability, and entropy conditioning that controls the uncertainty of classifier predictions to guarantee the transferability. Experiments testify that the proposed approach exceeds the state-of-the-art results on five benchmark datasets.


Information Constraints on Auto-Encoding Variational Bayes

Neural Information Processing Systems

Parameterizing the approximate posterior of a generative model with neural networks has become a common theme in recent machine learning research. While providing appealing flexibility, this approach makes it difficult to impose or assess structural constraints such as conditional independence. We propose a framework for learning representations that relies on Auto-Encoding Variational Bayes and whose search space is constrained via kernel-based measures of independence. In particular, our method employs the $d$-variable Hilbert-Schmidt Independence Criterion (dHSIC) to enforce independence between the latent representations and arbitrary nuisance factors. We show how to apply this method to a range of problems, including the problems of learning invariant representations and the learning of interpretable representations. We also present a full-fledged application to single-cell RNA sequencing (scRNA-seq).



Variational PDEs for Acceleration on Manifolds and Application to Diffeomorphisms

Neural Information Processing Systems

We consider the optimization of cost functionals on manifolds and derive a variational approach to accelerated methods on manifolds. We demonstrate the methodology on the infinite-dimensional manifold of diffeomorphisms, motivated by registration problems in computer vision. We build on the variational approach to accelerated optimization by Wibisono, Wilson and Jordan, which applies in finite dimensions, and generalize that approach to infinite dimensional manifolds. We derive the continuum evolution equations, which are partial differential equations (PDE), and relate them to simple mechanical principles. Our approach can also be viewed as a generalization of the $L^2$ optimal mass transport problem. Our approach evolves an infinite number of particles endowed with mass, represented as a mass density.