Clustering
Exact Recovery of Mangled Clusters with Same-Cluster Queries
Bressan, Marco, Cesa-Bianchi, Nicolò, Lattanzi, Silvio, Paudice, Andrea
We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clusters with margin. This removes the assumption that the clustering is center-based (i.e., defined through an optimization problem), and includes all those cases where spherical clusters are individually transformed by any combination of rotations, axis scalings, and point deletions. We show that, even in this much more general setting, it is still possible to recover the latent clustering exactly using a number of queries that scales only logarithmically with the number of input points. More precisely, we design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k^3 \ln k \ln n)$ oracle queries and $\tilde{O}(kn + k^3)$ time to recover the clustering with zero misclassification error. The $O(\cdot)$ notation hides an exponential dependence on the dimensionality of the clusters, which we show to be necessary thus characterizing the query complexity of the problem. Our algorithm is simple, easy to implement, and can also learn the clusters using low-stretch separators, a class of ellipsoids with additional theoretical guarantees. Experiments on large synthetic datasets confirm that we can reconstruct clusterings exactly and efficiently.
Unsupervised Embedding of Hierarchical Structure in Euclidean Space
Zhao, Jinyu, Hao, Yi, Rashtchian, Cyrus
Hierarchical clustering aims to find an iterative grouping of increasingly large subsets of data, terminating when one cluster contains all of the data. This results in a tree structure, where each level determines a partition of the data. Organizing the data in such a way has many applications, including automated taxonomy construction and phylogeny reconstruction [6, 26, 54, 69, 87]. Motivated by these applications, we address the simultaneous goals of recovering an underlying clustering and building a hierarchical tree of the entire dataset. We imagine that we have access to many samples from each individual cluster (e.g., images of plants and animals), while we lack labels or any direct information about the hierarchical relationships. In this scenario, our objective is to correctly cluster the samples in each group and also build a dendrogram within and on top of the clusters that matches the natural supergroups. Despite significant effort, hierarchical clustering remains a challenging task, theoretically and empirically. Many objectives are NP-Hard [13, 14, 18, 19, 35, 59], and many theoretical algorithms may not be practical because they require solving a hard subproblem (e.g., sparsest cut) at each step [2, 4, 13, 14, 19, 51].
Why it's Super Hard to be an ML Researcher or Developer?
A few months ago, I came across this true story of a Harvard professor of political science, Gary King, who started working on the document clustering problem to give a Festschrift (a collection of writings published in honor of a scholar) to one of his colleagues, as the retirement gift. To do so, he asked his grad students to utilize every clustering algorithm ever invented. For those who know about this, clustering is a very old problem in the field of machine learning and statistics. Hence, there were plenty of methods available to apply in the literature, and they found around 250 algorithms. To compare the efficiency of all the algorithms, they coded an R package and what did they found?
Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles
Altuncu, M. Tarik, Yaliraki, Sophia N., Barahona, Mauricio
Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.
Hypergraph Random Walks, Laplacians, and Clustering
Hayashi, Koby, Aksoy, Sinan G., Park, Cheong Hee, Park, Haesun
We propose a flexible framework for clustering hypergraph-structured data based on recently proposed random walks utilizing edge-dependent vertex weights. When incorporating edge-dependent vertex weights (EDVW), a weight is associated with each vertex-hyperedge pair, yielding a weighted incidence matrix of the hypergraph. Such weightings have been utilized in term-document representations of text data sets. We explain how random walks with EDVW serve to construct different hypergraph Laplacian matrices, and then develop a suite of clustering methods that use these incidence matrices and Laplacians for hypergraph clustering. Using several data sets from real-life applications, we compare the performance of these clustering algorithms experimentally against a variety of existing hypergraph clustering methods. We show that the proposed methods produce higher-quality clusters and conclude by highlighting avenues for future work.
SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
Naghavi-Nozad, Sayyed-Ahmad, Haeri, Maryam Amir, Folino, Gianluigi
This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Differently from the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. At the first phase, a temporary clustering model is built, then it is incrementally updated by analyzing consecutive memory-loads of points. Subsequently, at the end of scalable clustering, the approximate structure of original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object, which is called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.
Community detection in sparse time-evolving graphs with a dynamical Bethe-Hessian
Dall'Amico, Lorenzo, Couillet, Romain, Tremblay, Nicolas
This article considers the problem of community detection in sparse dynamical graphs in which the community structure evolves over time. A fast spectral algorithm based on an extension of the Bethe-Hessian matrix is proposed, which benefits from the positive correlation in the class labels and in their temporal evolution and is designed to be applicable to any dynamical graph with a community structure. Under the dynamical degree-corrected stochastic block model, in the case of two classes of equal size, we demonstrate and support with extensive simulations that our proposed algorithm is capable of making non-trivial community reconstruction as soon as theoretically possible, thereby reaching the optimal detectability threshold and provably outperforming competing spectral methods.
Probing Task-Oriented Dialogue Representation from Language Models
Wu, Chien-Sheng, Xiong, Caiming
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks. We approach the problem from two aspects: supervised classifier probe and unsupervised mutual information probe. We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way. Meanwhile, we propose an unsupervised mutual information probe to evaluate the mutual dependence between a real clustering and a representation clustering. The goals of this empirical paper are to 1) investigate probing techniques, especially from the unsupervised mutual information aspect, 2) provide guidelines of pre-trained language model selection for the dialogue research community, 3) find insights of pre-training factors for dialogue application that may be the key to success.
Using NumPy To Optimize Object Detection
This is Part 4 of our ongoing series on NumPy optimization. In Parts 1 and 2 we covered the concepts of vectorization and broadcasting, and how they can be applied to optimize an implementation of the K-Means clustering algorithm. Next in the cue, Part 3 covered important concepts like strides, reshape, and transpose in NumPy. In this post, Part 4, we'll cover the application of those concepts to speed up a deep learning-based object detector: YOLO. Here are the links to the earlier parts for your reference.
Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic Analysis
We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center. The distortion measure is the weighted sum of $r$th powers of the distances between the cluster center and the samples. For $L=1$, one recovers the ordinary center based clustering formulation. The general case $L>1$ appears when one wishes to cluster a dataset through $L$ noisy observations of each of its members. We find a formula for the average distortion performance in the asymptotic regime where the number of cluster centers are large. We also provide an algorithm to numerically optimize the cluster centers and verify our analytical results on real and artificial datasets. In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.