Arias-Castro, Ery
Minimax Optimality of Classical Scaling Under General Noise Conditions
Vishwanath, Siddharth, Arias-Castro, Ery
We establish the consistency of classical scaling under a broad class of noise models, encompassing many commonly studied cases in literature. Our approach requires only finite fourth moments of the noise, significantly weakening standard assumptions. We derive convergence rates for classical scaling and establish matching minimax lower bounds, demonstrating that classical scaling achieves minimax optimality in recovering the true configuration even when the input dissimilarities are corrupted by noise.
Graph Max Shift: A Hill-Climbing Method for Graph Clustering
Arias-Castro, Ery, Coda, Elizabeth, Qiao, Wanli
A hill-climbing algorithm is typically understood as an algorithm that makes'local' moves. In a sense, this class of procedures is the discrete analog of the class of gradient-based and higher-order methods in continuous optimization. Such algorithms have been proposed in the context of graph partitioning, sometimes as a refinement step, where the objective function is typically a notion of cut and local moves often take the form of swapping vertices in order to improve the value of the objective function. More specifically, consider an undirected graph consisting of n nodes, which we take to be [n]:= {1,..., n} without loss of generality, and adjacency matrix A = (a
An Axiomatic Definition of Hierarchical Clustering
Arias-Castro, Ery, Coda, Elizabeth
Clustering, informally understood as the grouping of data, is a central task in statistics and computer science with broad applications. Modern clustering algorithms originated in the work of numerical taxonomists, who developed methods to identify hierarchical structures in the classification of plant and animal species. Since then clustering has been used in disciplines such as medicine, astronomy, anthropology, economics, etc., with aims such as exploratory analysis, data summarization, the identification of salient structures in data, and information organization. The notion of a "good" or "accurate" clustering varies between fields and applications. For example, to some computer scientists, the correct clustering of a dataset is often defined as the solution to an optimization problem (think K-means) and a good algorithm either solves or approximates a solution to this problem, ideally with some guarantees [14, 31].
On the Selection of Tuning Parameters for Patch-Stitching Embedding Methods
Arias-Castro, Ery, Chau, Phong Alain
In the general problem known as multidimensional scaling (MDS), the primary objective is to represent a set of items as points within a Euclidean space of a specified dimension. This representation should ideally preserve the given pairwise dissimilarities as accurately as possible, by ensuring that the Euclidean distances between these points mirror the original dissimilarities. MDS is a extensively researched problem found in diverse fields such as psychometrics [16], mathematics, and computer science [9, 14, 57], engineering (where it is also known as network localization) [61], as well as statistics [3, 71] and machine learning [43, Ch 14]. Dimensionality reduction (DR) aims at embedding data points in a Euclidean space into a lower-dimensional Euclidean space while preserving, as much as possible, the geometry of the point cloud [36, 59]. When the data points are assumed to be on or near a smooth submanifold, a variant of DR known as manifold learning, this typically means preserving the pairwise intrinsic distances to the greatest extent.
Level Sets or Gradient Lines? A Unifying View of Modal Clustering
Arias-Castro, Ery, Qiao, Wanli
Up until the 1970's there were two main ways of clustering points in space. One of them, perhaps pioneered by Pearson [44], was to fit a (usually Gaussian) mixture to the data, and that being done, classify each data point -- as well as any other point available at a later date -- according to the most likely component in the mixture. The other one was based on a direct partitioning of the space, most notably by minimization of the average minimum squared distance to a center: the K-means problem, whose computational difficulty led to a number of famous algorithms [22, 31, 36, 37, 39] and likely played a role in motivating the development of hierarchical clustering [21, 25, 54, 63]. In the 1970's, two decidedly nonparametric approaches to clustering were proposed, both based on the topography given by the population density. Of course, in practice, the density is estimated, often by some form of kernel density estimation.
Minimax Estimation of Distances on a Surface and Minimax Manifold Learning in the Isometric-to-Convex Setting
Arias-Castro, Ery, Chau, Phong Alain
The estimation of shortest paths and intrinsic distances on surfaces is a fundamental problem in computational geometry with wide-ranging applications. In motion planning, shortest paths represent resource-efficient sequences of actions to be undertaken by the agent in some given configuration space [57, 56]. In addition to the clear applications to robot locomotion and manipulation, this framework has bore fruit in the field of biology wherein proteins and folding networks are of great interest [2, 76]. In cluster analysis, geodesic distances have found use as a similarity metric to create partitions that respect the underlying geometry [51, 65, 58]. In manifold learning, the Isometric Feature Mapping (Isomap) algorithm crucially depends on the approximation of geodesic distances on the underlying surface [75], and so does another important algorithm, Maximum Variance Unfolding (MVU) [79] although in disguise [64, 12]. This is closely related to the estimation of distances for the purpose of embedding a (weighted) graph (aka multidimensional scaling with missing distances), with one of the first methods suggested for that purpose being that of using the graph distances [55, 71, 70, 62].
Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning
Arias-Castro, Ery, Javanmard, Adel, Pelletier, Bruno
One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed for localization, and also manifold learning. Nonetheless, the robustness property of most of them is little understood. In this paper, we obtain perturbation bounds for classical scaling and trilateration, which are then applied to derive performance bounds for Isomap, Landmark Isomap, and Maximum Variance Unfolding. A new perturbation bound for procrustes analysis plays a key role.
A Simple Approach to Sparse Clustering
Arias-Castro, Ery, Pu, Xiao
Consider the problem of sparse clustering, where it is assumed that only a subset of the features are useful for clustering purposes. In the framework of the COSA method of Friedman and Meulman, subsequently improved in the form of the Sparse K-means method of Witten and Tibshirani, a natural and simpler hill-climbing approach is introduced. The new method is shown to be competitive with these two methods and others.
Community Detection in Sparse Random Networks
Arias-Castro, Ery, Verzelen, Nicolas
We consider the problem of detecting a tight community in a sparse random network. This is formalized as testing for the existence of a dense random subgraph in a random graph. Under the null hypothesis, the graph is a realization of an Erd\"os-R\'enyi graph on $N$ vertices and with connection probability $p_0$; under the alternative, there is an unknown subgraph on $n$ vertices where the connection probability is p1 > p0. In Arias-Castro and Verzelen (2012), we focused on the asymptotically dense regime where p0 is large enough that np0>(n/N)^{o(1)}. We consider here the asymptotically sparse regime where p0 is small enough that np0<(n/N)^{c0} for some c0>0. As before, we derive information theoretic lower bounds, and also establish the performance of various tests. Compared to our previous work, the arguments for the lower bounds are based on the same technology, but are substantially more technical in the details; also, the methods we study are different: besides a variant of the scan statistic, we study other statistics such as the size of the largest connected component, the number of triangles, the eigengap of the adjacency matrix, etc. Our detection bounds are sharp, except in the Poisson regime where we were not able to fully characterize the constant arising in the bound.
On the convergence of maximum variance unfolding
Arias-Castro, Ery, Pelletier, Bruno
Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing specific rates of convergence under standard assumptions. We find that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent.