Goto

Collaborating Authors

 Clustering


Data-Driven Clustering via Parameterized Lloyd's Families

Neural Information Processing Systems

Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a "typical instance" may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure.


Reviews: Hierarchical Clustering Beyond the Worst-Case

Neural Information Processing Systems

This paper studies the problem of hierarchical clustering in a beyond-the-worst-case setting, where the data is a random graph generated by a Hierarchical Stochastic Block Model (HSBM). It proposes an SVD linkage algorithm and proves that in certain regimes it returns a solution that is a constant approximation to the cost function of hierarchical clustering proposed by Dasgupta. It also considers an algorithm that combines the SDP relaxation approach and recursive sparsest cut approach in previous works to get a constant approximation in larger regimes. Roughly speaking, the HSBM model considered assumes some underlying tree T* over k leaves (each leaf corresponding to a bottom cluster containing a certain number of vertices). Each node in the tree has some weight in (0,1) and a node's weight should not be larger than its descendant's.


Reviews: A Practical Algorithm for Distributed Clustering and Outlier Detection

Neural Information Processing Systems

The paper addresses the problem of performing the distributed k-mean/median clustering in the presence of outliers, and at the same time identifying the outliers. Data are partitioned across multiple sites either adversarially or randomly, and the sites and a central coordinator work jointly by communications to get the data clustering and the outliers. The authors proposed a practical algorithm with bounded running time O(max{k,\log n} n), and bounded communication cost O(s(k\log n t)) and O(sk\log n t) for adversarial and random data partitioning respectively, for a dataset with n data points, k centers, t outliers, and partitioned across s sites. They used a traditional two-level clustering framework (Guha et al. 2017). If using a \gamma -approximation algorithm for (k,t)-mean/median as the second-level clustering algorithm, their distributed algorithm has a bounded O(\gamma) approximation factor. Extensive experimental studies were conducted to compare the performance of their algorithm with three baseline algorithms.


Reviews: Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search

Neural Information Processing Systems

The paper extends the work of Dasgupta towards defining a theoretical framework for evaluating hierarchical clustering. The paper defines an objective function that is equivalent to that of Dasgupta in the sense that an optimal solution for the cost function defined by Dasgupta will also be optimal for the one defined in this paper and vice versa. The behaviour of approximate solution will differ though because the cost function is of the form D(G) - C(T), where D is a constant (depending on the input G) and C(T) is the cost (defined by Dasgupta) and T is the hierarchical solution. The authors show a constant approximation guarantee w.r.t. They complement this approximation guarantee by showing that this approximation guarantee is almost tight.


Reviews: Clustering Billions of Reads for DNA Data Storage

Neural Information Processing Systems

The paper presents a solution to a new type of clustering problem that has emerged from studies of DNA-based storage. Information is encoded within DNA sequences and retrieved using short-read sequencing technology. The short-read sequencer will create multiple short overlapping sequence reads and these have to be clustered to establish whether they are from the same place in the original sequence. The characteristics of the clustering problem is that the clusters are pretty tight in terms of edit distance (25 max diameter here - that seems quite broad given current sequencing error rates) but well separated from each other (much larger distance between them than diameter). I thought this was an interesting and timely application.


Reviews: Inhomogeneous Hypergraph Clustering with Applications

Neural Information Processing Systems

This paper considers the hypergraph clustering problem in a more general setting where the cost of hyperedge cut depends on the partitioning of hyperedge (i.e., all cuts of the hyperedge are not treated the same). An algorithm is presented for minimizing the normalized cut in this general setting. The algorithm breaks down for general costs of the hyperedge cut; however the authors derive conditions under which the algorithm succeeds and has provable approximation guarantees. Detailed comments: The main contributions of the paper are Generalization of hypergraph partitioning to include inhomogeneous cut of the hyper edge; the motivation for this is clearly established. A novel technique to minimize the normalized cut for this problem.


Reviews: Fair Clustering Through Fairlets

Neural Information Processing Systems

While there is a growing body of work on fair supervised learning, here the authors present a first exploration of fair unsupervised learning by considering fair clustering. Each data point is labeled either red or blue, then an optimal k-clustering is sought which respects a given fairness criterion specified by a minimum threshold level of balance in each cluster. Balance lies in [0,1] with higher values corresponding to more equal numbers of red and blue points in every cluster. This is an interesting and natural problem formulation - though a more explicit example of exactly how this might be useful in practice would be helpful. For both k-center and k-median versions of the problem, it is neatly shown that fair clustering may be reduced to first finding a good'fairlet' decomposition and then solving the usual clustering problem on the centers of each fairlet.


Hierarchical Matrix Completion for the Prediction of Properties of Binary Mixtures

arXiv.org Artificial Intelligence

Predicting the thermodynamic properties of mixtures is crucial for process design and optimization in chemical engineering. Machine learning (ML) methods are gaining increasing attention in this field, but experimental data for training are often scarce, which hampers their application. In this work, we introduce a novel generic approach for improving data-driven models: inspired by the ancient rule "similia similibus solvuntur", we lump components that behave similarly into chemical classes and model them jointly in the first step of a hierarchical approach. While the information on class affiliations can stem in principle from any source, we demonstrate how classes can reproducibly be defined based on mixture data alone by agglomerative clustering. The information from this clustering step is then used as an informed prior for fitting the individual data. We demonstrate the benefits of this approach by applying it in connection with a matrix completion method (MCM) for predicting isothermal activity coefficients at infinite dilution in binary mixtures. Using clustering leads to significantly improved predictions compared to an MCM without clustering. Furthermore, the chemical classes learned from the clustering give exciting insights into what matters on the molecular level for modeling given mixture properties.


SHADE: Deep Density-based Clustering

arXiv.org Artificial Intelligence

Detecting arbitrarily shaped clusters in high-dimensional noisy data is challenging for current clustering methods. We introduce SHADE (Structure-preserving High-dimensional Analysis with Density-based Exploration), the first deep clustering algorithm that incorporates density-connectivity into its loss function. Similar to existing deep clustering algorithms, SHADE supports high-dimensional and large data sets with the expressive power of a deep autoencoder. In contrast to most existing deep clustering methods that rely on a centroid-based clustering objective, SHADE incorporates a novel loss function that captures density-connectivity. SHADE thereby learns a representation that enhances the separation of density-connected clusters. SHADE detects a stable clustering and noise points fully automatically without any user input. It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters, such as video data. Moreover, the embedded space of SHADE is suitable for visualization and interpretation of the clustering results as the individual shapes of the clusters are preserved.


A column generation algorithm with dynamic constraint aggregation for minimum sum-of-squares clustering

arXiv.org Artificial Intelligence

The minimum sum-of-squares clustering problem (MSSC), also known as $k$-means clustering, refers to the problem of partitioning $n$ data points into $k$ clusters, with the objective of minimizing the total sum of squared Euclidean distances between each point and the center of its assigned cluster. We propose an efficient algorithm for solving large-scale MSSC instances, which combines column generation (CG) with dynamic constraint aggregation (DCA) to effectively reduce the number of constraints considered in the CG master problem. DCA was originally conceived to reduce degeneracy in set partitioning problems by utilizing an aggregated restricted master problem obtained from a partition of the set partitioning constraints into disjoint clusters. In this work, we explore the use of DCA within a CG algorithm for MSSC exact solution. Our method is fine-tuned by a series of ablation studies on DCA design choices, and is demonstrated to significantly outperform existing state-of-the-art exact approaches available in the literature.