Not enough data to create a plot.
Try a different view from the menu above.
Xu, Jiaming
Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices
Chen, Yudong, Xu, Jiaming
We consider two closely related problems: planted clustering and submatrix localization. The planted clustering problem assumes that a random graph is generated based on some underlying clusters of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and stochastic block model, which are widely used for studying community detection and clustering/bi-clustering. For both problems, we show that the space of the model parameters (cluster/submatrix size, cluster density, and submatrix mean) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the \emph{impossible} regime, where all algorithms fail; (2) the \emph{hard} regime, where the computationally expensive Maximum Likelihood Estimator (MLE) succeeds; (3) the \emph{easy} regime, where the polynomial-time convexified MLE succeeds; (4) the \emph{simple} regime, where a simple counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the previous harder regimes. Our theorems establish the minimax recovery limit, which are tight up to constants and hold with a growing number of clusters/submatrices, and provide a stronger performance guarantee than previously known for polynomial-time algorithms. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax recovery limit may not be achievable by polynomial-time algorithms.
Computational Lower Bounds for Community Detection on Random Graphs
Hajek, Bruce, Wu, Yihong, Xu, Jiaming
This paper studies the problem of detecting the presence of a small dense community planted in a large Erd\H{o}s-R\'enyi random graph $\mathcal{G}(N,q)$, where the edge probability within the community exceeds $q$ by a constant factor. Assuming the hardness of the planted clique detection problem, we show that the computational complexity of detecting the community exhibits the following phase transition phenomenon: As the graph size $N$ grows and the graph becomes sparser according to $q=N^{-\alpha}$, there exists a critical value of $\alpha = \frac{2}{3}$, below which there exists a computationally intensive procedure that can detect far smaller communities than any computationally efficient procedure, and above which a linear-time procedure is statistically optimal. The results also lead to the average-case hardness results for recovering the dense community and approximating the densest $K$-subgraph.
Reconstruction in the Labeled Stochastic Block Model
Lelarge, Marc, Massoulié, Laurent, Xu, Jiaming
The labeled stochastic block model is a random graph model representing networks with community structure and interactions of multiple types. In its simplest form, it consists of two communities of approximately equal size, and the edges are drawn and labeled at random with probability depending on whether their two endpoints belong to the same community or not. It has been conjectured in \cite{Heimlicher12} that correlated reconstruction (i.e.\ identification of a partition correlated with the true partition into the underlying communities) would be feasible if and only if a model parameter exceeds a threshold. We prove one half of this conjecture, i.e., reconstruction is impossible when below the threshold. In the positive direction, we introduce a weighted graph to exploit the label information. With a suitable choice of weight function, we show that when above the threshold by a specific constant, reconstruction is achieved by (1) minimum bisection, (2) a semidefinite relaxation of minimum bisection, and (3) a spectral method combined with removal of edges incident to vertices of high degree. Furthermore, we show that hypothesis testing between the labeled stochastic block model and the labeled Erd\H{o}s-R\'enyi random graph model exhibits a phase transition at the conjectured reconstruction threshold.
Minimax-optimal Inference from Partial Rankings
Hajek, Bruce, Oh, Sewoong, Xu, Jiaming
This paper studies the problem of rank aggregation under the Plackett-Luce model. The goal is to infer a global ranking and related scores of the items, based on partial rankings provided by multiple users over multiple subsets of items. A question of particular interest is how to optimally assign items to users for ranking and how many item assignments are needed to achieve a target estimation error. Without any assumptions on how the items are assigned to users, we derive an oracle lower bound and the Cram\'er-Rao lower bound of the estimation error. We prove an upper bound on the estimation error achieved by the maximum likelihood estimator, and show that both the upper bound and the Cram\'er-Rao lower bound inversely depend on the spectral gap of the Laplacian of an appropriately defined comparison graph. Since random comparison graphs are known to have large spectral gaps, this suggests the use of random assignments when we have the control. Precisely, the matching oracle lower bound and the upper bound on the estimation error imply that the maximum likelihood estimator together with a random assignment is minimax-optimal up to a logarithmic factor. We further analyze a popular rank-breaking scheme that decompose partial rankings into pairwise comparisons. We show that even if one applies the mismatched maximum likelihood estimator that assumes independence (on pairwise comparisons that are now dependent due to rank-breaking), minimax optimal performance is still achieved up to a logarithmic factor.
Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results
Xu, Jiaming, Massoulié, Laurent, Lelarge, Marc
The classical setting of community detection consists of networks exhibiting a clustered structure. To more accurately model real systems we consider a class of networks (i) whose edges may carry labels and (ii) which may lack a clustered structure. Specifically we assume that nodes possess latent attributes drawn from a general compact space and edges between two nodes are randomly generated and labeled according to some unknown distribution as a function of their latent attributes. Our goal is then to infer the edge label distributions from a partially observed network. We propose a computationally efficient spectral algorithm and show it allows for asymptotically correct inference when the average node degree could be as low as logarithmic in the total number of nodes. Conversely, if the average node degree is below a specific constant threshold, we show that no algorithm can achieve better inference than guessing without using the observations. As a byproduct of our analysis, we show that our model provides a general procedure to construct random graph models with a spectrum asymptotic to a pre-specified eigenvalue distribution such as a power-law distribution.
Minimax-optimal Inference from Partial Rankings
Hajek, Bruce, Oh, Sewoong, Xu, Jiaming
This paper studies the problem of inferring a global preference based on the partial rankings provided by many users over different subsets of items according to the Plackett-Luce model. A question of particular interest is how to optimally assign items to users for ranking and how many item assignments are needed to achieve a target estimation error. For a given assignment of items to users, we first derive an oracle lower bound of the estimation error that holds even for the more general Thurstone models. Then we show that the Cram\'er-Rao lower bound and our upper bounds inversely depend on the spectral gap of the Laplacian of an appropriately defined comparison graph. When the system is allowed to choose the item assignment, we propose a random assignment scheme. Our oracle lower bound and upper bounds imply that it is minimax-optimal up to a logarithmic factor among all assignment schemes and the lower bound can be achieved by the maximum likelihood estimator as well as popular rank-breaking schemes that decompose partial rankings into pairwise comparisons. The numerical experiments corroborate our theoretical findings.
Jointly Clustering Rows and Columns of Binary Matrices: Algorithms and Trade-offs
Xu, Jiaming, Wu, Rui, Zhu, Kai, Hajek, Bruce, Srikant, R., Ying, Lei
In standard clustering problems, data points are represented by vectors, and by stacking them together, one forms a data matrix with row or column cluster structure. In this paper, we consider a class of binary matrices, arising in many applications, which exhibit both row and column cluster structure, and our goal is to exactly recover the underlying row and column clusters by observing only a small fraction of noisy entries. We first derive a lower bound on the minimum number of observations needed for exact cluster recovery. Then, we propose three algorithms with different running time and compare the number of observations needed by them for successful cluster recovery. Our analytical results show smooth time-data trade-offs: one can gradually reduce the computational complexity when increasingly more observations are available.