Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

arXiv.org Machine Learning

We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with $M \geq 3$ components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least $1-e^{-\Omega(M)}$. We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.

Statistical Guarantees for Estimating the Centers of a Two-component Gaussian Mixture by EM

arXiv.org Machine Learning

Recently, a general method for analyzing the statistical accuracy of the EM algorithm has been developed and applied to some simple latent variable models [Balakrishnan et al. 2016]. In that method, the basin of attraction for valid initialization is required to be a ball around the truth. Using Stein's Lemma, we extend these results in the case of estimating the centers of a two-component Gaussian mixture in $d$ dimensions. In particular, we significantly expand the basin of attraction to be the intersection of a half space and a ball around the origin. If the signal-to-noise ratio is at least a constant multiple of $ \sqrt{d\log d} $, we show that a random initialization strategy is feasible.

On the Behavior of the Expectation-Maximization Algorithm for Mixture Models

arXiv.org Machine Learning

Finite mixture models are among the most popular statistical models used in different data science disciplines. Despite their broad applicability, inference under these models typically leads to computationally challenging non-convex problems. While the Expectation-Maximization (EM) algorithm is the most popular approach for solving these non-convex problems, the behavior of this algorithm is not well understood. In this work, we focus on the case of mixture of Laplacian (or Gaussian) distribution. We start by analyzing a simple equally weighted mixture of two single dimensional Laplacian distributions and show that every local optimum of the population maximum likelihood estimation problem is globally optimal. Then, we prove that the EM algorithm converges to the ground truth parameters almost surely with random initialization. Our result extends the existing results for Gaussian distribution to Laplacian distribution. Then we numerically study the behavior of mixture models with more than two components. Motivated by our extensive numerical experiments, we propose a novel stochastic method for estimating the mean of components of a mixture model. Our numerical experiments show that our algorithm outperforms the Naive EM algorithm in almost all scenarios.

Robust EM algorithm for model-based curve clustering

arXiv.org Machine Learning

Model-based clustering approaches concern the paradigm of exploratory data analysis relying on the finite mixture model to automatically find a latent structure governing observed data. They are one of the most popular and successful approaches in cluster analysis. The mixture density estimation is generally performed by maximizing the observed-data log-likelihood by using the expectation-maximization (EM) algorithm. However, it is well-known that the EM algorithm initialization is crucial. In addition, the standard EM algorithm requires the number of clusters to be known a priori. Some solutions have been provided in [31, 12] for model-based clustering with Gaussian mixture models for multivariate data. In this paper we focus on model-based curve clustering approaches, when the data are curves rather than vectorial data, based on regression mixtures. We propose a new robust EM algorithm for clustering curves. We extend the model-based clustering approach presented in [31] for Gaussian mixture models, to the case of curve clustering by regression mixtures, including polynomial regression mixtures as well as spline or B-spline regressions mixtures. Our approach both handles the problem of initialization and the one of choosing the optimal number of clusters as the EM learning proceeds, rather than in a two-fold scheme. This is achieved by optimizing a penalized log-likelihood criterion. A simulation study confirms the potential benefit of the proposed algorithm in terms of robustness regarding initialization and funding the actual number of clusters.

Benefits of over-parameterization with EM

Neural Information Processing Systems

Expectation Maximization (EM) is among the most popular algorithms for maximum likelihood estimation, but it is generally only guaranteed to find its stationary points of the log-likelihood objective. The goal of this article is to present theoretical and empirical evidence that over-parameterization can help EM avoid spurious local optima in the log-likelihood. We consider the problem of estimating the mean vectors of a Gaussian mixture model in a scenario where the mixing weights are known. Our study shows that the global behavior of EM, when one uses an over-parameterized model in which the mixing weights are treated as unknown, is better than that when one uses the (correct) model with the mixing weights fixed to the known values. For symmetric Gaussians mixtures with two components, we prove that introducing the (statistically redundant) weight parameters enables EM to find the global maximizer of the log-likelihood starting from almost any initial mean parameters, whereas EM without this over-parameterization may very often fail. For other Gaussian mixtures, we provide empirical evidence that shows similar behavior. Our results corroborate the value of over-parameterization in solving non-convex optimization problems, previously observed in other domains.