Goto

Collaborating Authors

 Rinaldo, Alessandro


Statistical Inference for Cluster Trees

Neural Information Processing Systems

A cluster tree provides an intuitive summary of a density function that reveals essential structure about the high-density clusters. The true cluster tree is estimated from a finite sample from an unknown true density. This paper addresses the basic question of quantifying our uncertainty by assessing the statistical significance of different features of an empirical cluster tree. We first study a variety of metrics that can be used to compare different trees, analyzing their properties and assessing their suitability for our inference task. We then propose methods to construct and summarize confidence sets for the unknown true cluster tree. We introduce a partial ordering on cluster trees which we use to prune some of the statistically insignificant features of the empirical tree, yielding interpretable and parsimonious cluster trees. Finally, we provide a variety of simulations to illustrate our proposed methods and furthermore demonstrate their utility in the analysis of a Graft-versus-Host Disease (GvHD) data set.


Statistical Analysis of Persistence Intensity Functions

arXiv.org Machine Learning

Persistence diagrams are two-dimensional plots that summarize the topological features of functions and are an important part of topological data analysis. A problem that has received much attention is how deal with sets of persistence diagrams. How do we summarize them, average them or cluster them? One approach -- the persistence intensity function -- was introduced informally by Edelsbrunner, Ivanov, and Karasev (2012). Here we provide a modification and formalization of this approach. Using the persistence intensity function, we can visualize multiple diagrams, perform clustering and conduct two-sample tests.


Consistency of spectral clustering in stochastic block models

arXiv.org Machine Learning

We analyze the performance of spectral clustering for community extraction in stochastic block models. We show that, under mild conditions, spectral clustering applied to the adjacency matrix of the network can consistently recover hidden communities even when the order of the maximum expected degree is as small as $\log n$, with $n$ the number of nodes. This result applies to some popular polynomial time spectral clustering algorithms and is further extended to degree corrected stochastic block models using a spherical $k$-median spectral clustering method. A key component of our analysis is a combinatorial bound on the spectrum of binary random matrices, which is sharper than the conventional matrix Bernstein inequality and may be of independent interest.


Statistical Models for Degree Distributions of Networks

arXiv.org Machine Learning

We define and study the statistical models in exponential family form whose sufficient statistics are the degree distributions and the bi-degree distributions of undirected labelled simple graphs. Graphs that are constrained by the joint degree distributions are called $dK$-graphs in the computer science literature and this paper attempts to provide the first statistically grounded analysis of this type of models. In addition to formalizing these models, we provide some preliminary results for the parameter estimation and the asymptotic behaviour of the model for degree distribution, and discuss the parameter estimation for the model for bi-degree distribution.


Cluster Trees on Manifolds

Neural Information Processing Systems

We investigate the problem of estimating the cluster tree for a density $f$ supported on or near a smooth $d$-dimensional manifold $M$ isometrically embedded in $\mathbb{R}^D$. We study a $k$-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta. Under mild assumptions on $f$ and $M$, we obtain rates of convergence that depend on $d$ only but not on the ambient dimension $D$. We also provide a sample complexity lower bound for a natural class of clustering algorithms that use $D$-dimensional neighborhoods.


Estimating Undirected Graphs Under Weak Assumptions

arXiv.org Machine Learning

We consider the problem of providing nonparametric confidence guarantees for undirected graphs under weak assumptions. In particular, we do not assume sparsity, incoherence or Normality. We allow the dimension $D$ to increase with the sample size $n$. First, we prove lower bounds that show that if we want accurate inferences with low assumptions then there are limitations on the dimension as a function of sample size. When the dimension increases slowly with sample size, we show that methods based on Normal approximations and on the bootstrap lead to valid inferences and we provide Berry-Esseen bounds on the accuracy of the Normal approximation. When the dimension is large relative to sample size, accurate inferences for graphs under low assumptions are not possible. Instead we propose to estimate something less demanding than the entire partial correlation graph. In particular, we consider: cluster graphs, restricted partial correlation graphs and correlation graphs.


DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering

arXiv.org Machine Learning

The level set tree approach of Hartigan (1975) provides a probabilistically based and highly interpretable encoding of the clustering behavior of a dataset. By representing the hierarchy of data modes as a dendrogram of the level sets of a density estimator, this approach offers many advantages for exploratory analysis and clustering, especially for complex and high-dimensional data. Several R packages exist for level set tree estimation, but their practical usefulness is limited by computational inefficiency, absence of interactive graphical capabilities and, from a theoretical perspective, reliance on asymptotic approximations. To make it easier for practitioners to capture the advantages of level set trees, we have written the Python package DeBaCl for DEnsity-BAsed CLustering. In this article we illustrate how DeBaCl's level set tree estimates can be used for difficult clustering tasks and interactive graphical data analysis. The package is intended to promote the practical use of level set trees through improvements in computational efficiency and a high degree of user customization. In addition, the flexible algorithms implemented in DeBaCl enjoy finite sample accuracy, as demonstrated in recent literature on density clustering. Finally, we show the level set tree framework can be easily extended to deal with functional data.


Tight Lower Bounds for Homology Inference

arXiv.org Machine Learning

The homology groups of a manifold are important topological invariants that provide an algebraic summary of the manifold. These groups contain rich topological information, for instance, about the connected components, holes, tunnels and sometimes the dimension of the manifold. In earlier work, we have considered the statistical problem of estimating the homology of a manifold from noiseless samples and from noisy samples under several different noise models. We derived upper and lower bounds on the minimax risk for this problem. In this note we revisit the noiseless case. In previous work we used Le Cam's lemma to establish a lower bound that differed from the upper bound of Niyogi, Smale and Weinberger by a polynomial factor in the condition number. In this note we use a different construction based on the direct analysis of the likelihood ratio test to show that the upper bound of Niyogi, Smale and Weinberger is in fact tight, thus establishing rate optimal asymptotic minimax bounds for the problem. The techniques we use here extend in a straightforward way to the noisy settings considered in our earlier work.


Cluster Trees on Manifolds

arXiv.org Machine Learning

In this paper we investigate the problem of estimating the cluster tree for a density $f$ supported on or near a smooth $d$-dimensional manifold $M$ isometrically embedded in $\mathbb{R}^D$. We analyze a modified version of a $k$-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta. The main results of this paper show that under mild assumptions on $f$ and $M$, we obtain rates of convergence that depend on $d$ only but not on the ambient dimension $D$. We also show that similar (albeit non-algorithmic) results can be obtained for kernel density estimators. We sketch a construction of a sample complexity lower bound instance for a natural class of manifold oblivious clustering algorithms. We further briefly consider the known manifold case and show that in this case a spatially adaptive algorithm achieves better rates.


Recovering Block-structured Activations Using Compressive Measurements

arXiv.org Machine Learning

We consider the problems of detection and localization of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive (linear) measurements. This is closely related to the problem of compressed sensing, where the task is to estimate a sparse vector using a small number of linear measurements. Contrary to results in compressed sensing, where it has been shown that neither adaptivity nor contiguous structure help much, we show that for reliable localization the magnitude of the weakest signals is strongly influenced by both structure and the ability to choose measurements adaptively while for detection neither adaptivity nor structure reduce the requirement on the magnitude of the signal. We characterize the precise tradeoffs between the various problem parameters, the signal strength and the number of measurements required to reliably detect and localize the block of activation. The sufficient conditions are complemented with information theoretic lower bounds.