Montanari, Andrea
Contextual Stochastic Block Models
Deshpande, Yash, Sen, Subhabrata, Montanari, Andrea, Mossel, Elchanan
We provide the first information theoretical tight analysis for inference of latent community structure given a sparse graph along with high dimensional node covariates, correlated with the same latent communities. Our work bridges recent theoretical breakthroughs in detection of latent community structure without nodes covariates and a large body of empirical work using diverse heuristics for combining node covariates with graphs for inference. The tightness of our analysis implies in particular, the information theoretic necessity of combining the different sources of information. Our analysis holds for networks of large degrees as well as for a Gaussian version of the model.
A Mean Field View of the Landscape of Two-Layers Neural Networks
Mei, Song, Montanari, Andrea, Nguyen, Phan-Minh
Multi-layer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties? In this paper we consider a simple case, namely two-layers neural networks, and prove that -in a suitable scaling limit- SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples, and show how DD can be used to prove convergence of SGD to networks with nearlyideal generalization error. This description allows to 'average-out' some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.
On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition
Mondelli, Marco, Montanari, Andrea
We establish connections between the problem of learning a two-layers neural network with good generalization error and tensor decomposition. We consider a model with input $\boldsymbol x \in \mathbb R^d$, $r$ hidden units with weights $\{\boldsymbol w_i\}_{1\le i \le r}$ and output $y\in \mathbb R$, i.e., $y=\sum_{i=1}^r \sigma(\left \langle \boldsymbol x, \boldsymbol w_i\right \rangle)$, where $\sigma$ denotes the activation function. First, we show that, if we cannot learn the weights $\{\boldsymbol w_i\}_{1\le i \le r}$ accurately, then the neural network does not generalize well. More specifically, the generalization error is close to that of a trivial predictor with access only to the norm of the input. This result holds for any activation function, and it requires that the weights are roughly isotropic and the input distribution is Gaussian, which is a typical assumption in the theoretical literature. Then, we show that the problem of learning the weights $\{\boldsymbol w_i\}_{1\le i \le r}$ is at least as hard as the problem of tensor decomposition. This result holds for any input distribution and assumes that the activation function is a polynomial whose degree is related to the order of the tensor to be decomposed. By putting everything together, we prove that learning a two-layers neural network that generalizes well is at least as hard as tensor decomposition. It has been observed that neural network models with more parameters than training samples often generalize well, even if the problem is highly underdetermined. This means that the learning algorithm does not estimate the weights accurately and yet is able to yield a good generalization error. This paper shows that such a phenomenon cannot occur when the input distribution is Gaussian and the weights are roughly isotropic. We also provide numerical evidence supporting our theoretical findings.
An Instability in Variational Inference for Topic Models
Ghorbani, Behrooz, Javadi, Hamid, Montanari, Andrea
Topic models are Bayesian models that are frequently used to capture the latent structure of certain corpora of documents or images. Each data element in such a corpus (for instance each item in a collection of scientific articles) is regarded as a convex combination of a small number of vectors corresponding to `topics' or `components'. The weights are assumed to have a Dirichlet prior distribution. The standard approach towards approximating the posterior is to use variational inference algorithms, and in particular a mean field approximation. We show that this approach suffers from an instability that can produce misleading conclusions. Namely, for certain regimes of the model parameters, variational inference outputs a non-trivial decomposition into topics. However --for the same parameter values-- the data contain no actual information about the true decomposition, and hence the output of the algorithm is uncorrelated with the true topic decomposition. Among other consequences, the estimated posterior mean is significantly wrong, and estimated Bayesian credible regions do not achieve the nominal coverage. We discuss how this instability is remedied by more accurate mean field approximations.
The landscape of the spiked tensor model
Arous, Gerard Ben, Mei, Song, Montanari, Andrea, Nica, Mihai
We consider the problem of estimating a large rank-one tensor ${\boldsymbol u}^{\otimes k}\in({\mathbb R}^{n})^{\otimes k}$, $k\ge 3$ in Gaussian noise. Earlier work characterized a critical signal-to-noise ratio $\lambda_{Bayes}= O(1)$ above which an ideal estimator achieves strictly positive correlation with the unknown vector of interest. Remarkably no polynomial-time algorithm is known that achieved this goal unless $\lambda\ge C n^{(k-2)/4}$ and even powerful semidefinite programming relaxations appear to fail for $1\ll \lambda\ll n^{(k-2)/4}$. In order to elucidate this behavior, we consider the maximum likelihood estimator, which requires maximizing a degree-$k$ homogeneous polynomial over the unit sphere in $n$ dimensions. We compute the expected number of critical points and local maxima of this objective function and show that it is exponential in the dimensions $n$, and give exact formulas for the exponential growth rate. We show that (for $\lambda$ larger than a constant) critical points are either very close to the unknown vector ${\boldsymbol u}$, or are confined in a band of width $\Theta(\lambda^{-1/(k-1)})$ around the maximum circle that is orthogonal to ${\boldsymbol u}$. For local maxima, this band shrinks to be of size $\Theta(\lambda^{-1/(k-2)})$. These `uninformative' local maxima are likely to cause the failure of optimization algorithms.
Learning Combinations of Sigmoids Through Gradient Estimation
Ioannidis, Stratis, Montanari, Andrea
We develop a new approach to learn the parameters of regression models with hidden variables. In a nutshell, we estimate the gradient of the regression function at a set of random points, and cluster the estimated gradients. The centers of the clusters are used as estimates for the parameters of hidden units. We justify this approach by studying a toy model, whereby the regression function is a linear combination of sigmoids. We prove that indeed the estimated gradients concentrate around the parameter vectors of the hidden units, and provide non-asymptotic bounds on the number of required samples. To the best of our knowledge, no comparable guarantees have been proven for linear combinations of sigmoids.
Estimation of Low-Rank Matrices via Approximate Message Passing
Montanari, Andrea, Venkataramanan, Ramji
Consider the problem of estimating a low-rank symmetric matrix when its entries are perturbed by Gaussian noise, a setting that is also known as `spiked model' or `deformed Wigner matrix.' If the empirical distribution of the entries of the spikes is known, optimal estimators that exploit this knowledge can substantially outperform simple spectral approaches. Recent work characterizes the asymptotic accuracy of Bayes-optimal estimators in the high-dimensional limit. In this paper we present a practical algorithm that can achieve Bayes-optimal accuracy above the spectral threshold. A bold conjecture from statistical physics posits that no polynomial-time algorithm achieves optimal error below the same threshold (unless the best estimator is trivial). Our approach uses Approximate Message Passing (AMP) in conjunction with a spectral initialization. AMP algorithms have proved successful in a variety of statistical estimation tasks, and are amenable to exact asymptotic analysis via state evolution. Unfortunately, state evolution is uninformative when the algorithm is initialized near an unstable fixed point, as is often happens in low-rank matrix estimation problems. We develop a a new analysis of AMP that allows for spectral initializations, and builds on a decoupling between the outlier eigenvectors and the bulk in the spiked random matrix model. Our main theorem is general and applies beyond matrix estimation. However, we use it to derive detailed predictions for the problem of estimating a rank-one matrix in noise. Special cases of these problem are closely related -- via universality arguments -- to the network community detection problem for two asymmetric communities. As a further illustration, we consider the example of a block-constant low-rank matrix with symmetric blocks, which we refer to as `Gaussian Block Model'.
Inference in Graphical Models via Semidefinite Programming Hierarchies
Erdogdu, Murat A., Deshpande, Yash, Montanari, Andrea
Maximum A posteriori Probability (MAP) inference in graphical models amounts to solving a graph-structured combinatorial optimization problem. Popular inference algorithms such as belief propagation (BP) and generalized belief propagation (GBP) are intimately related to linear programming (LP) relaxation within the Sherali-Adams hierarchy. Despite the popularity of these algorithms, it is well understood that the Sum-of-Squares (SOS) hierarchy based on semidefinite programming (SDP) can provide superior guarantees. Unfortunately, SOS relaxations for a graph with $n$ vertices require solving an SDP with $n^{\Theta(d)}$ variables where $d$ is the degree in the hierarchy. In practice, for $d\ge 4$, this approach does not scale beyond a few tens of variables. In this paper, we propose binary SDP relaxations for MAP inference using the SOS hierarchy with two innovations focused on computational efficiency. Firstly, in analogy to BP and its variants, we only introduce decision variables corresponding to contiguous regions in the graphical model. Secondly, we solve the resulting SDP using a non-convex Burer-Monteiro style method, and develop a sequential rounding procedure. We demonstrate that the resulting algorithm can solve problems with tens of thousands of variables within minutes, and outperforms BP and GBP on practical problems such as image denoising and Ising spin glasses. Finally, for specific graph types, we establish a sufficient condition for the tightness of the proposed partial SOS relaxation.
Fundamental Limits of Weak Recovery with Applications to Phase Retrieval
Mondelli, Marco, Montanari, Andrea
In phase retrieval we want to recover an unknown signal $\boldsymbol x\in\mathbb C^d$ from $n$ quadratic measurements of the form $y_i = |\langle{\boldsymbol a}_i,{\boldsymbol x}\rangle|^2+w_i$ where $\boldsymbol a_i\in \mathbb C^d$ are known sensing vectors and $w_i$ is measurement noise. We ask the following weak recovery question: what is the minimum number of measurements $n$ needed to produce an estimator $\hat{\boldsymbol x}(\boldsymbol y)$ that is positively correlated with the signal $\boldsymbol x$? We consider the case of Gaussian vectors $\boldsymbol a_i$. We prove that - in the high-dimensional limit - a sharp phase transition takes place, and we locate the threshold in the regime of vanishingly small noise. For $n\le d-o(d)$ no estimator can do significantly better than random and achieve a strictly positive correlation. For $n\ge d+o(d)$ a simple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with the same spectral estimator demonstrate promising performance with realistic sensing matrices. Spectral methods are used to initialize non-convex optimization algorithms in phase retrieval, and our approach can boost the performance in this setting as well. Our impossibility result is based on classical information-theory arguments. The spectral algorithm computes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp characterization of the spectral properties of this random matrix using tools from free probability and generalizing a recent result by Lu and Li. Both the upper and lower bound generalize beyond phase retrieval to measurements $y_i$ produced according to a generalized linear model. As a byproduct of our analysis, we compare the threshold of the proposed spectral method with that of a message passing algorithm.
Inference in Graphical Models via Semidefinite Programming Hierarchies
Erdogdu, Murat A., Deshpande, Yash, Montanari, Andrea
Maximum A posteriori Probability (MAP) inference in graphical models amounts to solving a graph-structured combinatorial optimization problem. Popular inference algorithms such as belief propagation (BP) and generalized belief propagation (GBP) are intimately related to linear programming (LP) relaxation within the Sherali-Adams hierarchy. Despite the popularity of these algorithms, it is well understood that the Sum-of-Squares (SOS) hierarchy based on semidefinite programming (SDP) can provide superior guarantees. Unfortunately, SOS relaxations for a graph with $n$ vertices require solving an SDP with $n^{\Theta(d)}$ variables where $d$ is the degree in the hierarchy. In practice, for $d\ge 4$, this approach does not scale beyond a few tens of variables. In this paper, we propose binary SDP relaxations for MAP inference using the SOS hierarchy with two innovations focused on computational efficiency. Firstly, in analogy to BP and its variants, we only introduce decision variables corresponding to contiguous regions in the graphical model. Secondly, we solve the resulting SDP using a non-convex Burer-Monteiro style method, and develop a sequential rounding procedure. We demonstrate that the resulting algorithm can solve problems with tens of thousands of variables within minutes, and outperforms BP and GBP on practical problems such as image denoising and Ising spin glasses. Finally, for specific graph types, we establish a sufficient condition for the tightness of the proposed partial SOS relaxation.